HEWLETT-PACKARD 

JOURNAL 



February 1996 




l^El 



HEWLETT* 
PACKARD 



In Memoriam 




On November 23, 1995, Hewlett-Packard lost a close friend and a consistent source of 
inspiration with the death of Barney Oliver. Barney w^is a towering figure in HVii his- 
tory, but he was especially close to IIP Laboratories, die {organization he fountled and 
directed for almost 25 yeais. Some of his technical contributions were described in 
more tlian a dozen IIP Journal articles. But his technical achievements were Just part 
of the legacy he leaves. No one who ever worked with or knew Barney can forget Ms 
insistence on excellence and innovalion. He was a great intellect and a master inven- 
tor whose interests and expertise spannetl many scientific disciplines. He will be 
deeply missed. 



Febru^iy It^fl Hi*wl<?U-Pjwrkiml Jntirtuil 



)Copr. 1949-1998 Hewlett-Packard Co. 



HEWLETT-PACKARD 

JOURNAL 



Febryary 1996 Volume 47 • Number 1 



Articles 



8 



Symmetric Multiprocessing Workstations and Servers System-Desinned for High 
Performance and Low Cost by Matt J, Ha dine, Brendan A Voge, Loren P. Staiey, and 

BddirM. Mousa 



1 r\ K-Class Power System 



1 A High-Performance, Low-Cost Multiprocessor Bus for Workstations and Midrange Servers, 
by WilHam R. Bryg, Kenneth K. Chan, and Nichohs S. Hduccia 

/ / Runway Bus Electrical Design Considerations 



25 
34 

44 
52 



Oesrgn of the HP PA 7200 CPU, by KBnneth K. Chan, Cyrus C. Hay John B. KbIIbc Gordon R 
Kurpanet Francis X. Schumacher, and Jason Zheng 

Verification, Characterization, and Debugging of the HP PA 7200 Processor, by Thomas B. 
Alexander, Kent A. Dickey, David N. Goldberg, Boss V. La Ferra, James B McGee, Nazeem 
Noordeen, and Akshya Prakash 

A New Memory System Design for Commercial and Technical Computing Products, 

by Thomas B. Hotchkiss, Norman D. Marschke, and Bichard M. McClosky 

Hardware Cache Coherent Input/Output, by Todd J, Kjos, Helen Nusbaum, Michael K. Traynor 
and Brendan A. Voge 



Executive EdiiGf, Steve QaMei * Managing Editor, Charlss L Leath • Senior Iditor, Bkrhard P Dolan • Assistant Editor. Robin Everest 
Publicutioo Pradoctioo Manager, Su^an E Wright- tltuslntioii, Ren«e 0. Pighini ■ Tfpograpliy/Layoul John Nicoara 

Advisory Board. Raitev Batlyai. laiegrated CifCiiiJ Busiuess Dtvismn, fmCofUns. Cohradn^ThQyna^ Beecher. Qp^n Systems SuftWafs Ds\/ism, 
CheimstGfd, Massachuseltes ■ Steven Brittervliam, Disk Msfnofy Bivkian^ Bone, /tfe/ro* Willi am W Brown, fntegrswii Circuit Business Divisioo, Ssnia 
Cfara, CwH^mis* RBJesh De-sai, CommersiBl Systems Division, CujierPno. Csiihrrna • Ksv'm B, Ewert, integrated SystEfns Divtsion, Sunuyvsie, Ca/ifomra • 
Bern hard FiscJier, Boiiitngen Medtcul Dtvtsion, 8ol?itngen. Qerfjjffny* Daugies Gannatlen, Gfseley W^f tf copy JDwrpr?, GfselBy, Colorado* Garv Gordon, 
HF labof&iangs, PaieAfto, Califomis* Mark Gurzynsk*. inkjst BupiiUss Business Unit, Ccfvaitis, Drsgon* hAmd. Harfine, Systems r&ci^jfoiogy Division, 
f^osBviiiB, C3h'fornt3» Kivov^su' Hiwada, Hgnhiaji Semiconductor TestDimioii Tokyo, Japan* Srysn Hoog, Lake Sievens instrument Dimion. EvBrett^ 
V\^shtngton • €. StBuan Joinfir. Qpuca! Canimunicauoft Divisim. San Jose, Cafifomis * ^o^er L Jung-erman, Microwave Tect}m!ogy DivtSfori. Santa Rosa, 
Cslifumia • Forrest Keliert, Micrdwave Technciosfy Diviston, Sanis Rosa, Catifofnm * RuCv B. Ue. {\iBix/QTked Systems Broup, CuperUno, Ceiitamia * Swee 
Kwang \jm.Asia Feripirereis Divtsisn, Sinsepars • AHted Maitte, Weid^ronn AnafyticaS Divtsipn, Wafdbrann^ Germsfyy Aritir^'fj McLean, Enterprise 
Messagmg Operation, finewood. England* Qcina L. Miller, Woridwide Costotnsr Support Dtvision. Mountain View, CsHfofnm ■ Mttchell Witnar, HF-Eis^f 
Dimion. Westlake ViHags, Cdiif^mfs * Michaal P. MDora, VKi SystEmE Divisioo, Ipveland, Caimado • Shelley I Mo&no, SariDisgo Frtnter Dmsion, 
San Diego, Caiifom/s * M Shahid Mujtaba, HP Isbgratnuas, PainAfjo, Csfifonns • SxevBn J. Narcisc, \/Xi Systems Division, txiuBf^nd, Coiorsdo * Danriy 
J. OJdfieFtf, iiectronic M&asurGTney^ts Division, CDiorado Springs. Coiorsda* GsrrY Orsoljni, Software Technology Division. RmBvilis. Csiifnrnis • Ken 
PoLilTon, HP LabDrBtoriBs, PainAlto^ Caiifowia* GOnier RiebeselL Bdbliftgen (nslrurnsms OiVfStDn, BdifMngeo. GBrmany* IWarc Sateiella. SoffwarB 
Engineering Systems Division, Fnrt Colfifjs^ Colorado ■ Miciisel B. Saund&rs, Int&grat^d drctJit BussnBss Dwsion, Cntvalfis.. QtBgon* Phiiip Stanton, WP 
laboratories Bristol Bristol. England* SEepheti R Undv. Systems Tsctnology Division, fonColUns. Colorsdo • Jim Wiltits, A/eewo/* and System 
ManagBmEnt Division Fort Collins, Colorado * Koichl YanBgawa, Kobe fnstrunmnt Division, Kobe, Japan * Dartnjs C. Vojk, CotvslUs Division. Cofvallis. 
Or^gdrj* BartaFflZimmar. Corporaje Engineering. Faio Afto. C^htorma 



©Hewlett -Packard Zom^srvi 1 996 Pritited in U.SA Tna Hewlett- Packtrd Journal (S printetJ on recyolBd psptr. 

4 Febraaty 1996 Hewlett-Packaiid Joum aJ 



)Copr. 1949-1998 Hewlett-Packard Co. 



Kn A 1.062S'(iliit/s Fibre Channd Chipsat with Laser Driver, by Justin S. Chang, Richard Dugaii 
Benny W.H. Lai, and Margaret M. Nakamato 

nH Applying the Code Inspection Process to Hardware DescriptiDns, by Joseph J, Gilray 



f Q Overview of Code-Domain Power, Uming, and Phase Measyrements. by Raymond A. 
Birgenheier 



Departments 



5 In this Issue 

7 Cover 

7 What's Ahead 

94 Authors 



ThB Hew[sR'Pflct.erd Journal f^ pitbiistifld btmon;blv by ihc HHwlfltt-Pflr.kard Campany (Q recogni/a techniGfl) cantributaoris made tiy HswSeTt-PackBfCt 
SHPI pEfrsonnul While ihic inTormarion ^Quncf inihia pi^bl^CBtioji tstoelieved la be eccurale, ttiQ He-uvlelt-Picliord Compsriv tttsqlajm&aJJ warrantrnsd 
merphantabiFiEy and titnes^s for a pfirtJCtilar purposo and sH Qhhq^aliQnk and liabn^Mifls ior damig&g, includinci but ndt linmiedi la Indirect, spfic»3l or 
GDTTStiqiJfinrial ETgiriBgDs, aRDrney'sartd axpert^s teas, and court costs, arising oulol or in cannQcrlon wiih this ptjblicatj^^in 

Subscrrptiofis Tim HfwEett-Packaid Jotirnaj js distrLbuted trei ot charge to HP rgssafch, design Br»d mnnufactunng enginttnng pEfreonnet aswflJl as 
tc qualifiurf rron-HP incjividuali, libraries, and BducaiJonal institurinns.Ta racsivo an HP emplovoQ sutiscfiptiuii you tan Sfltitt an smaii nissaage iirdkcatrng 
your HP entiTv gnd maiifitoy to t6c. liipfO@hfi-piloaltD-gartl3.Ofn.hip.com OuaiHfred nprt'HP inrlividuaJs, litjfarfcs, and educ-BtipnisI [nstitutiDni in xha US 
can ^oqupsr a subscripriion by Cftfier vvrrsinfl to: Disinbuifon Msna^CF, HP Journpl, y,''S2QBH, MK) Hanovor StrseL Pai:D Altc, CA 9UM, or sar^diing an emsi'i 
messsge To bp^joumalifilbp-p^lciafto-flanllotn hp.corn When su&miftfcng an addfflss cKantlE, plsase send a copy ot your old kbsl to ttifl address art tfiH 
hiaclt [jover. JnTerratiBrial subscnptjons can be rRqueited by wntrng Eci line HP headqti a itarp office m your courtiry ortb DisiTibution Manaser. address 
StiDvav Fraa subacriplit^ns may not be avalEabJe m all coLinlrias. 

The Howvlatt-Pacliflrd Journal us available online via the Wofld-Wids Web [WWW} and can be viewed end printed with WosaJC and s posiscnpt vFpwcr, or 

dovmloaded ro a hard driva and printed to e pbitMnpt pnmtr. The unifcfm rgsouFCO locator iURl) for rfip HflwlQU-Packaid Joumal <s httpvVvvww.bp.tQm/ 
hpj/JoLfrnnl.html 

SubrnissTons; Allhough drthdeg in Ihe Nawlett-Packard Journal are pntnanly aLieliorod by HP smploy^es. anrda^i irom non-HP authors dealing vviih 
HP-ralaled ruNeariih or ^oliilionttq lechn^cgl pToblamEi made pos^tblt; by u^ing HP equipmont sre oinQ considered lor publication Pli^asc cor^lactthe 
Edrtof baForfl iubmitting siioH ariictei. Also, the Hewlart-PacSsard Journal enccjura^BS teohriijcai: rftscuisior^s oi the Topifs presenied in recent articles 
end meypubHsti lettarft aipected to be ohnteres-tto roaders Lottors should hp bnel, ^nd ^rg subtectt^ edtting by HP. 

Dopyrigttt x: 1390 Hawfeit-PacliaTd Company All ngbis resarvBd. Pa-rmission to copy wffhoiit fe-o bH Of part of thi$ publicat'on is horebygrantftd proyidBd 
ttiat 1|ith$ oapms are not mado, us«d, display&d. or dlatributed for comm&rcral advantage; 2Hho Hewlett-Packard Company copyright notice andtha tnle 
{ft tliB publrcatmn and deto appaar on the eaprsi, andSf u notice spp&ars stating that the capy^ntj i£ by permiBjion of tha Hawintt- Packard Company 

Please address iiii^Liirre!*, sobmiEsiona, and requo^t^ to iditof, Hevvtett-Pi^kord Joiirnat M/S 206 H. ItJCffi Hanover Street, Palo Altp, CA M^ US A. 



)Copr. 1949-1998 Hewlett-Packard Co. 




In this Issue 

Symmetric muttiprocessing means that a system can distribute its workload 
evenly over multiple CPUs. Thus, coupled with high-performance memory and 
I/O subsystems, a symmetric muftiprnr posing system \z able to provide batartued 
higln performance, The article on page 8 begins a groyp of artictes that describe 
a new family of symmetric multiprocessing workstations and servers that have 
both low cost and high performance and satisfy a wide range of customer 
needs. First, there are the HP 9000 J -class high-end workstations and the HP 
9000 K-class servers, which run the HP-UX* operating system, and then there 
are the HP 3000 Series 9x9KS servers, which run the MPE/iX operating system. 
The J-class workstation provides up to two-way symmetric multiprocessing, and 
the K-class server provides up to four-way symmetric multiprocessing. These systems are based on the 
superscalar PA-RISC processor called the PA 7200 (page 25) and a highspeed memory bus called the 
Runway bus (page 18). 

The Runway bus, which is the backbone of these J/K-class platforms, is a new processor-to-memory- 
and't/0 interconnect that is ideally suited for for one-way to four-way symmetric multiprocessing for 
high-end workstations and midrange servers. The bus includes a synchronous, 64-bit, split-transaction, 
time multiplexed address and data bus that is capable of sustained memory bandwidths of up to 763 
Mbytes per second in a four-way system. 

One of the design goals for the Runway bus was to support the PA 7200 and future processors. The PA 
7200 is an evolution of the high-performance, single-chip superscalar PA 7100 CPU design. The processor 
and the Runway bus are designed for a bus frequency of 120 MHz in a four-way multiprocessor system, 
which enables the sustained memory bandwidth of 7BB Mbytes per second mentioned above. The PA 
7200 contains all the circuits required for one processor in a multiprocessor system except for external 
cache arrays. Among some of the features contained in the PA 7200 are a new data cache organization, 
a prefetching mechanism, and two integer ALUs for general integer superscalar execution. The PA 7200 
is described on page 25 , 

The increased functionality and higher operating frequency of today's VLSI chips have created a corre- 
sponding increase in the complexity of the verification process. In fact, chip verification activities now 
consume more time and resources than design. The article on page 34 describes the functional and 
electrical verification process used for the PA 7200 processor to ensure its quality and correctness. 
Since the design of the PA 7200 was based on the PA 7100 processor, verification could begin very early 
in the design because the same modeling language and simulator used for the PA 7100 could be used for 
the PA 7200. The article also describes debugging activities performed and the testability features pro- 
vided on the PA 7200. 

After investigating ways to improve customer application performance by observing existing platforms, 
the HP 9000 J/K-class design team determined that memory capacity, memory bandwidth, memory 
latency, and system-level parallelism (multiple CPUs and I/O devices all accessing memory in parallel) 
were key elements in achieving high performance. As the article on page 44 describes, a major improve- 
ment in memory bandwidth was achieved through system-level parallelism and memory interleaving, 
which were designed into the Runway bus and the J/K-class memory subsystem. 

Cache coherency refers to the consistency of data between processors {and associated caches), memory 
modules, and I/O devices. For the HP 9000 J/K-class systems, a scheme called hardware cache coherent 
1/0 was introduced. This technique involves the I/O system hardware in ensuring cache coherency, 
thereby reducing memory and processor overhead and contributing to greater system performance. 
Cache coherent I/O is discussed in the article on page 52. 

The articles on pages 60 and 6B are more papers from the proceedings of HP's 1995 Design Technology 
Conference (DTC). The first article describes a 1 .Q625-6bit/s Fibre Channel transmitter and receiver chipset 
About three years ago HP introduced the first commercially available, two-chip, 1.4-gigabit-per'Second, 
low-cost, serial data link interface, the HP HDMP t^-iink chipset. The new chipset, the HP HDMP-1512 



P'rhman' l^}i>^ Hewlctt-Pfirkard Joumnl 

©Copr. 1949-1998 Hewlett-Packard Co. 



(transmitter) and the HP HDMP-1514 (receiver) are a low-cnst gigabit solution for Rbre Cfiannel applica- 
tions. The chipset implefnents the Rbre Channel FC-Q physical layer speciftcation at 1.0625 Gbits/s. The 
transmitter features 20:1 data muhiplexing with a comma character generator anci a clock synthesis 
phase-locked loop, and includes a laser driver and a fault monitor for safety. The receiver performs 
clock recovery. 1:20 data demultiplexing, comma character detection, word alignment, and redtindant 
loss-of-stgnai alarms for eye safety. 

The other DTC paper (page 68} discusses usmg the traditionat software code inspection process for 
inspecting hardware descriptions written in Verilog HOL (hardware description language). The code 
inspection process for software devetopment has been around for awhile and has proven itself to be an 
effective tool for finding design and code defects and sharing best practices among software engineers. 
The authors found that except for some issues specific to HDL the format and resufts of their inspectian 
process were very similar to the standard software inspection process. 

The Telecommunications Industry Association (TIA) has released two standards (IS-95 and IS-97) that 
specify the various measurements required to ensure the compatibilfty of North American CDMA [code 
division multiple access) cellular transmitters and receivers, CDMA, which is used by the cellular tele- 
phone industrvn is a class of modulation that uses specialized codes to provide multiple communication 
channels in a designated segment of the electromagnetic spectrum. The article on page 73 provides a 
tutorial overview of the operation of the algorithms in the HP B32D3B CDMA cellular adapter which is 
designed to make the base station measurements specified in the TIA standards. The article also covers 
the general concepts of CDMA signals and measurement and some typical measurements made with 
theHPB3203B. 

C.L. Leath 
Managing Editor 



Cover 

The HP 900D J/K-class servers and workstations and the HP 3000 Series SxSKS servers are system-de- 
signed for high performance and low cost meaning that all of the boards have design features that opti- 
mize their functionality specifically for these systems. The cover is a group photograph of the boards, 
individual descriptions and photos can be found in the article on page 3. 



What's Ahead 

Articles planned for the April issue include eight articles on the Common Desktop Environment (CDE| for 
systems based on the UNIX- operating system and articles on the PalmVue mobile patient data system, 
the HP G1009A protein analyzer, and a power module for cellular telephones. 



)Copr. 1949-1998 Hewlett-Packard Co. 



Symmetric Multiprocessing 
Workstations and Servers 
System-Designed for Higli 
Performance and Low Cost 



A new family of workstations and servers provides enhanced system 
performance in several price classes. The HP 9000 Series 700 J-class 
workstations provide up to 2'Way symmetric multiprocessing, while the 
HP 9000 Series 800 K-class servers (technical servers, file servers) and 

HP 3000 Series 9x9KS business-oriented systems provide up to 4-way 
symmetric multiprocessing. 

by Matt J, Harline, Brendan A, V^oge, Loren P, Staley, and Badir M* Mouse 



Blending higli peziorniame aiici low ec^st, a new faxiilly of 
workstations and Berv'ers has been designed to help main- 
tain HP's leadership iii system perfonnance, price/perfor- 
niaiice, system support, and system reliability. This aiticle 
and the accompanying articles in tliis issue describe the 
design and implementation of the IIP 900IJ J-class work- 
stations, which are high-end workstations tinming the 
HP-irX^ c^perating system, the HP 90(!f) K-class serv^ers, 
which are a family of tnidrange technical mid business ser\'- 
ers nmning the HP-UX operating system, and the HP -3000 
Series 9x9KS sen'ei-s, which «ire a family of midrange btist- 
ness servers iiinning the MPE/iX operating system. In this 
issue, these systems will be referred to collectively ais J/K- 
class systems. 

The goals of the the design learn were to achieve high per- 
formance and low cost, while at the same time creating a 
broad family of systems that would shaie many of the saitie 
components and meet a wide range of customer needs. The 
challenge was to create a list of requirements tliat would 
meet the needs of the tluee different target markets: the 
UNIX ^-sysienvbi^ed workstation market, the UNIX-system- 
based serv'er market, and Hewlett-Packard's proprietaiy 
MPEl/'iX-sy.stem-based sener market. The biisic require- 
ments for these systems were to deliver leadei^hlp symmet- 
ric multiprocessing perfonnance. memorj' peif omiance, and 
capacity, along vtith exceptional I/O perfoniiance. Balanced 
system performance was the overall goal. 

Hardware Features 

AH of the J/K-rlass platfonns aie built around the same basic 
buildhig blocks (see Fig. 1). The backbone of these systems 
is the high-speed processor-memoiy bus called the Runway 
bus. Tltis is a (MO-to-TGS-Mbyte/s (peak siLstained baitd width), 
(34-bit-wade bus tliat connects the processoi^. system main 
memory^ and the I/O adapter (bus con\ erter). The Rtmway 
btis is described in more detail m the article on page 18. The 
I/O aciapter provides comiections to two HP-HSC (Hewlett- 
Packard higli -speed system connect) buses, pro\iding a raw 



1/0 baiidwidth between 128 Mbytes/s aitd IGO Mbytes per 
second (96 to 116 Mbytes/s peak sustained band\%idth). The 
HP-HSC l)us is an extension of the GSC (General System 
Connect) bus used m eai'Uer workstations.^ 

In addition to the Runway and HP-HSC buses, tlie J/K-class 
systems also snppDrt a comiecti\it>' I/O bus. hi the HP 9000 
J-ciass workstation systems, the connectivity I/O bus is 
EISA (Extended Industry^ Standard Architecture); it has a 
peak bandvildth of 32 :VIbytes/s. In the HP 9000 K-class and 
HP 3000 Series 9x9 KS sender systems, the connecti\ity I/O 
bus is the IIP Precision Bus (IIP-PO). The serveiis have one 
or two four-sloi IIP-PB adapters. Each HF-PB has a peak 
bandwidth of 32 Mbjies/s. 

Processor 

The core of the J/K-class systems is a high-performance pro- 
cessor module that interfaces cUrectly to the Runway bus. It 
is based on the HP PA 7200 CPl^ chip,- a PA-RISC super- 
scalar processor, whicfi is an evokilion of the high-perfor- 
mance, single-chip, superscalai' PA 7100 processoi\ The PA 
7200 incorporates a Mgh-speed Runway bus interface, a new^ 
data cache organization with an on-chip assist cache, data 
prefetching, and two integer AI.1 s. Tliis nticroprocessor is 
fabricated using HP's 0.55-micrometer CMOS process and 
d eh vers reliable performance up to 120 MHz, More informa^ 
tion on the PA 7200 can be found in the article on page 25. 
Fig. 2 is a photograph of the processor module. 

The tables on page 10 indicate the processor speeds for 
each of the platforms in rhe ,J/K-class family. Table I is for 
the HP-UX workstation systems. Table II is for the HP-IX 
s>"inmetric multiprocessing senders, and Table HI is for tiie 
MPE/LX s:yiiHnetric multiprocessing sen ers. 

System Bo^ird 

Central to the J/K-class systems is the system circuit board 
(Pig. 3). Tills printed cii'cuii board contains all the circuitry 
required for impiementing the Runway bm and connectpiB 



8 Febnmr^' 1996 Hi*wlett-Packard Joumic: 



)Copr. 1949-1998 Hewlett-Packard Co. 



Systtm Power Monitor 



Pefipheral Bay 




fast/WiiteSCSI 



Single- Ended SCSI 



Memory Carrier 




I = IfistructJon 

=^ Data 

MMC= Master Memory Co iTti^iler 

SMC = Slave Memory C^mtrolier 

DM ~ Data Muttipleicer 

lOA = I/O Adapter 

PDH = Prece^or Dependent Hardware 

PDC - Processor Dependem CcK^e 

HSC = High-Speed Sysiem Connect 



HP-HSC Expofision Carrier Cerd 

"T 

Core I/O 



n HP'HSCBiis 
^ HP-HSC Bus 



HP^HSC/ ■ SCSI 
HP'PB ■ Bus 
Adapter I Adapter 



Cere f/0 H Expansion 
Comrollar I Slel 



i I 



Console 

Access Port 

MUX nnd 

Modem 



TTT 



Fast/Wide 
SCSI 



I I I 



Loeet Cen stile Pen 
Remote Console Port 

Other MUX Ports 



'I 

Port I 



IEEE 



FestAVide 
SCSI 



Panel Port 



Mouse 
Keyboard 



Fig. 1* Block cMagrani of the 

HI'' \MU) Model K40a server. 



PA72iX) CPU with Heat Sink 



Ceche BAMs 



Caclte RAMs 




Fig, 2. Irixessurniodole. 



for the processors, I be master memory controller, aaid the 
VQ adapter. The bootstrap code and other systeiii-specific 
hardwaie aje alscj on ihe system board. For the entire family 
of J/K-(iass systems, there aie only iliree syst.eii^ board de- 
signs: one for the 1-way or 2-\vay s^Tiimelric miUtiprocessmg 
workstation configuration (J class), one for the uniproces- 
sor server configuration (K 100), and one for the 1-way to 
4-way symmetric invj hi processing ser\er systems t K2x0, 
K4x0). ' 

In the workstation systems, the system board inchules the 
Runway bus and system dependent hardwme mentioned 
above, the complete memory system including the coimee- 
tors for the niemor>^ mocJules (SIMMs), most of the circuitry 
required for the sj'stem s l>uili-in 1/0 functionality (core 1/0), 
aiul powei- sii|)ply management ami control cirt nits. Five* I/O 
slots are juo\ided for system I/O expansion. These five slots 
are sliaied; a combination of EISA and HP-HSC cards can be 
installed, witli a maximum of foui^ EISA cards or three IIP- 
HSC cai'ds. For exan^ple, a system could have four EISA 



r-V t fninry 1 9fl6 HewleU-Paf kard .laimiaJ 9 



)Copr. 1949-1998 Hewlett-Packard Co. 



Table 1 
HP 9000 J-Class Processor Speeds 

Processor Processor 

Slots Speed 

J200 2 100 MIz 

Ms i 120 MHz 



Model 



Table II 
HP 9000 K-Ciass Processor Speeds 



MnAa\ 


Processor 


Processor 




Sfots 


Speed 


mm 


I 


100 MHz 


w^ 


S 


100 MHz 


wm 


4 


120 MHz 


K400 


4 


100 MHz 


K4L0 


4 


120 MHz 



Series 



Table III 
HP 3000 Series 9x9KS Processor Speeds 

Processor Processor 
Slots Speed 

939KS 1 80 MHzt 

959KS 4 100 MHz 

969KS 4 120 MH^ 



t Effective Pracessor Speed 

cards and one HP-HSC card, or three EISA and two HP-HSC 
cards, or two EISA and three HP-HSC cards. 

The symmetric multiprocessmg server system hoard in- 
cJudes Ihe Runway bus and s^^^tcm dependent liardware as 
described above t along witli slots for a separate core I/O 
card, an optional expansion HP-HSC I/O ctirrier card, one or 
two IG-byte memory earners, and four or eight HP-PB slots. 
Four Runway slots are pro\ided for the processor titcxiules* 
Depending on the processor used in tJie system, the Runway 
bus operates at 100 MHz or 120 MHz. 

The uniprocessor system hoaixl (HP 9000 Model KlOO) has a 
smgle processor, all memory contiollers ai\d SIMM slots, a 
core I/O card slot, and four HP-PB slots. 

System Firmware 

All of tlie J/K-class systems share a conmion firmware Irase 
that tests and mitializes the system on powerup. This code 
is a combination of PA-RISC assembly code and C. It w;is a 
design goal to support all of the sei-v^er profhicts usirtg the 
s^ijue firmware and to have a common firmware base for the 
technical \^'orkst^1ion {products. The code was designed in a 
very modulaj' fashion so thai Ibe code base could be easily 
ported to the vailous system p I at f onus. 

The system firmware is designed to be very robust. For ex- 
aniple, during memor>^ configuration and test, it uses a com- 
binat.lon of bank and page deallocation to deconfigure mem- 
ory containing hard errors, aOowing the user to continue 
using the systen^ until tlie faUing memor^^ can be replaced. 



Similarly, processors that fail self-test are deconfigured and 
the system boot process is continued. 

In addition to providing a robust system to tiie ciLslomer, the 
system firmware allows designers and the manufacturing 
processes easy access to system test and configuration of 
hardware and firmware features. Some of these features 
allow enabling or disabling of processor cache prefetching, 
full memory test or niemory initialization only, and so on. 
Tills helped in the system debug e£foi1 by speeding tlie boot 
process and niakuig it possible to disable certain functions 
while searching for Lhc root cause of a bug in the system, 

.Another feature built into the system furmware during the 
system debug process was a debug interface thai would 
allow the lab engineers to set soft breakpoints and step 
tluough in.sT ruction execution one mstructi{m al a time* This 
tool proved to be quite v^duable, prrMding increased visibility 
into systen\ behavior and tlie system state. 

The J/K^lass firmware is metalled in flash EPROM. Tlie 
firmware can be updated tlirougli the system offline diag- 
nostic enwonment, K for any reason the system firmware 
needs to be mocUfied, it can easily be upgra<ied by loading a 
new fiiTuware image from tape or another luedimn into 
system memory and then loading it into U\e fmtiware flasii 
EPROM. 

The result of these design choices is system firmw^are that 
provides flexible func:tionality, reliable system test and ini- 
tialization, and some tolerance for certain t.v^cs of failed 
components in the system boot process. 

High-PerfrjrmaJice Memory 

Memor>^ performance was Mghly important throughont the 
J/K<'Iass system design and implementation. The J/K-class 
n^ejtiory subsysiem is liesigned wiih consideration for high 
bandwitil h, low latency, and expandability from 32M bytes 



Dual l/Q Adapter 



Master Memory 
Controller 




a HP'PB VO Slots 



Fig. 3, Sysrein l>ciard. 



Ex|MinSIDFI 

HP HSC I/O 



10 Fefomaiy 1996 Hewlett^ackari3 Juuma] 



)Copr. 1949-1998 Hewlett-Packard Co. 



to 2G bytes. It is capable of interleaving memoiy accesses 

across S2 banks of memory. The memai>' sj-^stem is built 
around the master memory controller (MMC), which inter- 
faces to the high-speed Runway bus. The M\1C communi* 
cates %ith op to ei^it slave nieniQr>' conuollefs (SMC) on 
one or two memory carriers (see Fig. 4 and the article on 
page 44). Also on the memory' carriers are data multiplexers 
and pairs of SIAL\te (single-inhne memor>^ modules). ThLs 
design resulis in a high-bandwidth, interleaved, 2G-b>^e 
memory subsystem. As 64M-bit DRAMs become cost- 
effective, the 2G'byte limit \\ill increase to 3=75r> bytes of 
main memory. Table IV shows the maximum amounts of 
memor\' available in \'arious J/E-class s^'Siems, 



System Type 
HP 9(100 

IIP Dmo 

HP :300D 



Table W 
Maximum Memory 

Model or Series Maximum Memory 

Mr .ri. 1-1:^x0 1024M Bytes 

MrHl^^l K]iin 5I2M Bytes 

Modt-l Ki:xiJ 1024M B>tes 

Model K4x0 2048M Bytes 



Series 939KS 
Series 959KS 
Series 969KS 



!856M Bytes 
2fM8M Bytes 
2048M Bytes 



Si^^|>orttng a liigh-density and high-perform lu ice memory 
sysiem with indiis try-standard ruemf>r>^ SIMMs would have 
resulted in a costly memory system tJiat would not have 
performed at the desired levels. Instead of the industry- 
standard apt>roach, a denser memor>' module was designed. 
These memory modules (Fig. 5) are actually a iluaJ-inUne 
design, althougli they aie stih referred to as SIMMs, hi the 
J/K-class systems, these dual SIMlVIs are insert t^d m t>airs, 
pnjviding two separately atidressabie, 128-bit, ECC (error 
correcting code) protected banks of memory (141 bits 
kncludhig EVC check bit^). Each dual SIMM provides 72 bits 
of the f wo 144-bit banks. Using 4M-bit or 16M'bit DRAMs, 
tbB SIMMs arc available in ttjM-byte and 64M-b>le sizes. 
While these memor>^ luoduies Jtrt^ not standard, there Is no 
HP pr(>ijrietar> lechu<j]og> in lliem, iitHping to keep memoiy 
pricing very competitive wkh the industiy. 

I/O Adapter 

Tlie J/Knlass I/O adapter (bus converter) interfaces between 
the Rmiway bus and the f IP-fISC I/O bus. Tlie I/O require^ 
nients for a J/Knrdass system call for multiple I/O buses, so 
tlie I/O adapter package contains two fully inclepentlenr bus 



Slave Mfimofy 
Comrolltrs 



Da I a MUX 



J 

l.-l 


If 




SiMM Slats 



SIMMSlnts 




Fig, 4. Mi'[uory ('arrier board. 



Fig. 5. Dual-lntine memcur module. 

converters (see Fig. 6a J. To maximi2e system flesdbilitj; the 
I/O adapter is designed to support a rajtge of bus frequen- 
cies on either bus, thus requiring a full synchronizer. Fig, 6b 
is a block diagram of the I/O ariapten 

Tlie HP'HSC bus only has a -JE-bit address space, while the 
Runway biLs supports a 40-bit address space. This requires 
an address translation mechanism to provide the additional 
eight address bits. The processor's aggressive data prefetch- 
ing requires a new mechanism for DMA (direcl nTemor>' 
access) to coexist \^itll this processor feature. Hardware 
cache coherent 1/0 solves these two problents (see article, 
page 52). Prefetchittg is also ijicluded in tlie HP-HSC bus to 
redutT n>enior>' read latency and increase I/O bandwidth. All 
of these features required additional hardware support in 
the I/O adapter. 

According to the PA- RISC architectxire definidon, a bus con- 
verter also needs to |>n>vide die registers to configure address 
space, enable and disable features, log en'ors, manipulate 
tlie TLB, and pro\ide diagnostic access. Therefore, these 
registers ju^e uichided in the I/O adapter. 

Tht* J/K-class systems re^juirerl several other hardw^are fea- 
tures that by default were put into tlie I/O adapter. Among 
these is the hardware to interface to external romponenLs 
impletnentiitg the |>rocessor dependent haidware (PDII) 
neees.sary ta pro\irle tjoot firmware, stable storage for 
system configuradon infonnatlon and errtjr loggings and 
scratch RAM. Tlie I/O aflat>ler also provides a real-tinte 
clock for keeping track of time when power is off. 

Basic VLSI sui^jjort of scan-based testing, Injlh internal and 
bouiulary (JTAG or IEEE 1 149.1). is btiill into t)ie E/O adapter, 
akmg witb doubie-strobe fapability for speed path testing 
and built-in self-test (BIST) for tbe liAM stnictures, 

FinalJy, it was desired that the design be done in a modular 
fashion, enabling future designs to easily borrow portions of 
tiie design for fulure enhancements or to lower costs. This 
required that the chip be designed with well-defLned mxd 
simple interfaces. The synclironizers mad*^ very natural 
] J aces to define the bomidaiies of these modules. All of 
liiese re<|UjremenLs led to a modular, sjiicbnmizer quetie- 
coupled, hardware cache coherent, dual bus converter 
design. 

Core I/O Fundi on all tj- 

Tlie basii^ UO retiuirement-S for Ijoth the workstation and the 
server systems trie hide 20-Mbyte/s last-wide SCSI (Small 
Compitter System Interface) for system disk connectivity, 
5"Mbyte/s single-ended SCSI for archival storage, and an 
It':i:i': 802.3 LAN interface for networking. The tIP-UX 



Vi' h miuy VMMlowlvU -I'acksird JoiiniaJ 1 1 



)Copr. 1949-1998 Hewlett-Packard Co. 



Runway Sifs. 



Processor 

Dependent 

Hardware Pah 



m 




HF-HSCBus 
PQrt2 



Runway Buffers: 



r 



1 



1 



flunway Receive 
Registef 



I Pod jOulstairdifig 

\ ReaiJ Transact ions 

and Freletcli 

Data) 



Runw&v Drive 
Register 




HP-HSC 



Fig* b. ia) Tl^ere are uvo tully independent 1/0 adapters In the I/O 
adapter pac kage. Qy) VO adapter bJock diagjani. 

systems also iiicliidG a Biti'oiiics paiallel iiuerface port, 
keyl:>oaitl ai\d mouse coimectioiis. and serial 1/0 port^ as 
pait of the cove 1/0 functionality. A photograph of the K- 
ciass core I/O l>oard is shown in Fig, 7, Tlie wajkstation 
model adds high-quality audio input and output to the 
built-in core I/O. 

The sender system includes additional serial ports and an 
integrated modem for remote sendee access. Tl\e seiver 
systems also have a remote service console access port to 
allow rem Die senicing of hiird\^ are and softi^^aie by 
Hewlett-Packard's customer support organizations. 



I/O Expansion 

On the IIP 9000 K-class server systems, several configirra- 
lions support various system 1/0 needs (see Table V). As a 
minimum, the system conies widt one 32-MHz IIP-HSC bus 
slot for expansion 1/0, This slot is in a compact 3-by-5-inch 
fonn factor. As I/O needs increasen the system can be up- 
graded to provide four 4fJ'Mliz HF-HSC- slots in addition to 
the one 32-MIiz IIP-HSC sloi (Model K4x0 oidyj. In addition 
to !he HP-ILSC slots, the K<'lass serv^er has four or eight 
Hewlett-Packard Precision Bus (HP-PB) slots. These slots 
ai'e configiued such that the user can install up to four 
d( juble-fiigh HP-PB cards and still have four single-high 
HP PB ca^d slois available. 





Table V 
HP 9000 K-Class I/O Expansion Capabilities 


Model 


HP-HSC 

Bus Slots 


HP-PB Peak Sustained 
Slots I/O Bandwidtht 


KlOO 


i 


4 95 Ml.)yies/s 


K200 


i 


4 211Mbytes/s 


K2I0 


i 


4 211 Mbj^es/s 


K400 


5 


& 211 Mbytes/s 


K410 


5 


i: 211 Mbytes/s 



t Combtned bandwidth of ttre twa HP-HSC buses. 

In the HP 9000 J-class workstation configurations, the system 
supjiorrs an 8-MHz EISA bus (ntaximum of four slots) and a 
40-MHz HP-HSC expaasion I/O bus (maximum of three slots). 
These slots provide the workstation user v^dtli a gteat deal of 
flexibUity in configuring PO devices and meeting liigh-speed 
I/O requirements ( see Table \^ ). 

Table VI 

HP 9000 Model JZxO I/O Expansion Capabilities 

Configuratien 



I/O Slot 

Slot 
Slot 1 
Slof 2 
Slot 3 
Slot 4 



HP-HSO 

HP-HSC or EISA 

HP-HSC or EISA 

EISA 

EISA 



^, 




riOBase-T/i^grnal 

Iaui 



Opiianal 

Sid»Hnr4 Hp-HSC 

Modem I S'^^ 

Serial fmn 
Distribution 
Panel 

Fig, 7, K<iass core L/'O board . 



External 



Serial 

UPS Port „ 

Support 

I Modem 

Fast' Wide | port 

Differential Parallel 

SCSI Port 



Mnus« 



Keytgard 



12 Febtuaiy 1996 Hewlett-Packard JduitihI 



)Copr. 1949-1998 Hewlett-Packard Co. 



A number of VO cards are cuireptly available for use in the 

high-speed lO aiid HP-PB bus slots. A number of EISA cards 

are supported in the workstation system. The foliowing is a 

partial list of the I/O cards available for J/TC-class s^-stenis: 

5-Mb>teis single-ended SCSI 

20-Mbyte/s fast-vdde SCSI 

HP Fiber Link 

IKEE 802.3 LAN 

IEEE 802.5 token ring 

FDDI 

FibreChannd 

ATM f asviichronous transfer mode) 

Progranunable serial interface 

Bitronics pai"allel port 

16-port serial RS'232 

:32-pon serial RS-232 

20 graphics card 

3D graphics card. 

Integrated Peripherals 

The seiver systems all have a standard DDS tape drive ant I a 
CD-ROM drive integrated into the system. In addition, there 
is space a\ ailable for up to four 20-Mbyte/s SCSI disk drives 
in the system box. The workstation routes wilh a standard 
^J.^'i'inch flexible disk drive, a CD-ROM drive, a tape drive, 
and two slot^ for 20-Mb^te/s SCSI disk drives built into the 
system box. 

Industrial Design 

The J/K-tiass industrial design is intended to convey a strong 
perception tjf tiie power within, wrapped in bold, distinctive 
designs that are elegant and pleasing to the eye. The K-class 
product is designed to wfirk as a tloor-staiuimg product as 
well tis rack-monnted in an inthisti-y-standaixl IIP 19-hich 
EIA rack (Fig, 8). A growing number of rack-mounted IIP 
peripheral products such as disk anaya, uninteniiptible 
powTr supplies, and LAN hubs complement the overall sys- 
tem. Hie J -class system (Fig. 9] is designed for floor-standing 
use in the commercial workstation enviromnent, Imt can be 
rack-mounted on a custom basis. 

Tliese machines were designed with ease of assembly and 
ser\dceabLlity as high priorities. They use plastic pails that 
snap together over a riveted steel chassis without a single 
screw or fiislcnet; making assembly and disassemljly very 
quick and easy for service and for the e\'entnal recycling at 
the end of the products' life. 

Customer ease of use was another design priority This is 
e\ident in the brightly backlit liquid cryistal display, which 
conveys system status information in a clear text font, a %'ast 
improvement over pre\ious systems, whk^h had Hashing 
LEDs. A simple three-position keys witch for on, off, and 
service mode is clearly marked and positioned within easy 
reach on the front of tiie K-class system. 'Hie front door 
gives tlie user e^isy access to peripherals anfl visual feedback 
in the form of disk activity lights. Inside the front door are a 
pocket far tlie user n\anual, a safe storage location for the 
system key, and a system label w-ith the most pertinent user 
inlonnaiion. 

Extensi\ f effort went into label design, working with field 
support engineers, to make these products the leaders in 



their class in ease of insiallatioa serviceability, and field up- 
gradabihty. The labels use color coding and detailed dia^'ams 
clearly defining such things as board locations, memory^ 

SDBl loading sequences, and disk locations. These have 
been ^■cr>^ successful in making the many configuratioas 
easily understood by customers and HP manufacturing and 
field service personnel 

Ser%*er Package Design 

Like everv^ other aspect of the design of the J/K-class s>^ 
terns, designing the chassis and plastics proved to be chal- 
ienging- With a strong emphasis on development schedule 
and a desire for a ver^' robust and flexible design, the engi- 
neering team liad to create some innovative solutions to 
keep on schedtile and keep the cost of the product low. 

Several requirements defined the maximum height and 
width of the sener box. It had to fit into a lii*'incii rack, so It 
could be no wider than 17 J inches and no deeper than 25 




Fig, 8, K-ciajiS server confii^uraUons. 



)Copr. 1949-1998 Hewlett-Packard Co. 



Fphmary 1996 Hewl(?U-Pflckait1 Jouitial 13 




Fig* 9* J-d^iss \^^nikstation system protiessii^g unit. 

inches. It coukf be no taller than 25.24 inches, so that as a 
standalone unit it would fit uiKler a standard table. 

An addh ional chtdlenge was that of tlie Runway bus. The 
expert ed high speeds of the bus required that the bus length 
be ke[>t to a rtiininiiiin to reduce signal propagation delays. 
At the sanie time, up to six different components needed to 
attaeh to thi.s bus: four processors, the master memory con- 
troDer (MMCJ, and the I/O adapter. The processor rtiodule 
spacmg was kept to 1.2 ij^ches, allowing die overall lengtl^ of 
the Kim way bus to be short enougli to support reliable 
120-Mliz operation. 

This smaO size also presented an additional challenge, that 
of cooLLng the many components in tiie system. An intensive 
effort was lamiched to simulate and create mockups of the 
proposed mechtmical designs for aii'llow and expeti ed inter- 
nal system temperature rises. A number of cooling alterna- 
tives were proposetl and evaluated. The resulting solution 
provides an air-cooled system with excelleiit airflow^ and a 
remarkably quiet system for the power dissipated within the 
box. There are only two fans in the entire s^^tcm. 

Other desired features helped to define how tiie system was 
paititioned into the various circuit boards. Each board 
needed to be easy to access. Ahiiosi every component in the 
system can be accessed and removed from the system for 
maintenance or repair in a matter of minutes. The size and 
type of add-on 1/0 cards also lequired some creative design 
to allow flexibility^ in the design as well as llexibility for the 
customer. 



power monitor, and the iminterruptible power supply (TIPS). 
The powder features implemented in the workstation and 
server systems are sliglitiy different but follow the same 
general philosophy: provide reliable power to the system at 
a low cost. 

The server system power supply pro\ides the voltage rails 
for the components in the system. This 925'Watt. supply in- 
cori>orales power factor correction and accepts a wide 
range of input voltages and in] jut frequencies between 50 
and 60 Hz. ft provides a carryover time of 20 ntUliseconds 
^fter a power failure. 

The system power supply does not include any intelligence 
to control the system turn-on or reset activities. This mtelh- 
genre is pro\ideil by the systeni power monitor (Fig. 10 j. 
This circLut monitors the various aspects of the r-omputer 
system and the power siipt>ly outpitl to determijie if the 
power shoidd be tunietl on or off. This includes monitoring 
the system internal temperature, checking the voltage out- 
piU to ensure diat tlmre is no unden'oltage or overvoltage 
condition, and providing diagnostic messages on the system's 
liquid ciystal display when problems occur. 

The iminterruptible power supply (UPS) is an optional com- 
ponent of the J/K-ciass systems. It provides additional assur- 
ance of system availability and data integrity even if the ac 
power lines fail for any reason. Upon a p<jwerfail event, the 
UPS provides ac power to the system for up to 15 minutes, 
allowing the system to continue oiJeration, or in the case of 
an extended power Dutage, to shut down gracefnliy and save 
critical data to disk. When ac power returns, the system will 
contmue operation, or if it was shut down, it can be restarted 
without loss of data. 

More details on the power supply, monitor^ and UPS can be 
found on page 16, 

System Performance 

The J/K -class systems were developed to pro\ide customers 
with excellent perfonnance in the intended mmkets: mid- 
range servers and high-end workstations. Our goal was not 
necessaiily to pro\ide the highest single-component perfor- 
mance, but U} provide customer-valued application perfor- 
mance at an extremely attractive price. 



L L_ 





I 111 1 



Power System 

Tlie power system for the J/K-class systems can be spht into 
three subsystems: the system power supply the system 



Fig* 10. System pa^^"er nionitt>r. 



14 Febniaiy 1 996 Hewl elt-P^ckard Joum ■ 1 1 



)Copr. 1949-1998 Hewlett-Packard Co. 



The most coimnon component benchmark is the SPEC 

(Systems Performar^ce Evaluation rcj<}|>erative) siiite, which 
measures CPl' mteger aiul noaiing-pouit performance. For 
these benchmarks, the processor in the J/K-class products 
pro\irfes about 168 SPECmt92 mteger and 258 SPECfp92 
noating-potnt performance at 120 MHz, With the well- 
balanced sjTiimetiic multiprocessing J/K-class s>-s!ems. \}w 
SPECrate^int performance of a four-processor 120'MHz 
system is 12,150 and the SPECrate_fp^ performance is 
19,600. 

It is in the real- world applications and at the system le\'el 
where the J/K-cIass computer ^/steitis really start to shine. 
The balanced design of the Runway processor-memory bus, 
the memoi>^ subsystem, and the performance L^ system 
provides the user vnih exceptional performance, TWo widely 
used benchmarks that tr>^ to measure the perfomiance of 
realistic customer workloads are SPEC SFSl.OdADDIS) 
^d TPC-C. The i20-MHz LADDIS performance of Uu* J/K- 
ciass Is as high as 4750 I/O operations per second, which 
exceeds many high-end serv^eis that typically have twice as 
many processt*rs (S to 10 processors compared to four for a 
Model K400), A four- way J/K-class server at 120 MHz has 
demonstrated in excess of -1000 transactions per minute on 
the TPC-{.' transaction henchmaik. At tJie time of introduc- 
tion of the J/X-class systems, the only other single system 
with higlier performance was HP's own T500 corporate 
business sender. 

On the technical side, workstation applic^fions clearly benc^ 
fit from the increased memory bandwidth. Al tht* same time, 
the introduction of multiprocessing in a high-end client 
configiu^dtion provides the opportimity for either parallel 
processing of a single task or more parallel execution for 
multiple tasks. Wiih the addition of the new high-end HP 
Visual ize4 8 graphics, for which the J-chiss systems provide 
some specific hardware perfomiance enh^mcers, the work- 
Station products will handle large, complex design and 
visualis&alion problems easily. 

Design for Lasting Value 

T\w J/K-c*lass sy.stf'nLs were designed to provide lEP's custom- 
ers with lasting value. Processons can bv pjLsiJy added rtj the 
system, to a maxinuun of two processors in tlie workstation 
systems and up to tour processors in the server systems. 
Ljjgrading from 100-MHz to 120-MHz processors is just as 
simple. Tlie J/K-class systems are also designed to accept 
future process*)rs easily stich as the PA SfXJO processor; ' 
through a simple processor module upgrade. 

Not only are processors easy to upgrade, but memory and 
I/O i\re also designed so that it is easy to add memory and 
I/O furictionality Memory can be added in 32M-l>yte or 
128M-ljyte increments up lo 1024M bytes in a J-cUiss system 
or up to 2048M bytes in a K-tlass or Series 9x9 KS system. As 
increased-den.sily DRAMs become cost-effective, memory' 
limits wiU increase to -i^riCJ l>ytes of mahi memory, filling 
most useis* niemorj^ conflgii ratio tt iind capjuity requin^ments 
far into the future. 

System Verification 

The design of any computer system requires an extensive 
test an tl verification effort. Por the chips mid bumtls (k^ 
signed for the J/K-class platforms, many engineer-months 



were dedicated to ensuring the systans manufactured and 
shipped to IlPs customers are of the highest quality and 
reliability^. This testing can be grouped into several different 
categories: presilicon chii> and sj'stem simulation, formal 
verification methods, ^^stem fimctional verification, chip and 
system electrical characterization, and system ^'ahdation. 

Simylsttofi. Before committing any part of the J/K-class de- 
sign to sihcon. extensive simulation had already pro\?en the 
basic funcdonaliry of each component individually and as 
part of the system. Each component design team pro\ided a 
model of their particular part of the design to an overall sys- 
tem simulation team. Tile system simulation team then pulled 
together tools first to simulate subsystems and eventually to 
simulate the entire J/K<'lass system. In addition to the logical 
simulation to verify- correct functionality; electrical smiula- 
tion was done for the critical portions of the system such as 
clock distribution, system buses, and chip intemaj critical 
paths (see article, page 34). 

Format Methoifs. For some parts of any tlesign, it is very diffi- 
cult to \ erify complete adlierence to design specifications. 
One area of concern in the J/K-class design was the bus pro- 
tocols for the Runway bus. In an effort to reduce risk and 
improve system reiiaijiliry, forn^al methods^ were used to 
analy>je tlte bus tnmsaction protocols used in tlie Rtmway 
bus definition. The analysis pointed to several defects, w^hich 
were corrected before implementation of the system. 

Functional Verification, As the first components became avail- 
able tti the design teams for initial debugging, efforts were 
foctised on verifying that eacli comjjonent functioned prop- 
erly in the systeni^ Tlie fimt goal was to boot the system to 
the initial system loader At this point either the operating 
system (HP-UX) coidd be loaded or system and component 
diagnostics could be loaded. While bocjting the operating 
system is a gieat accomplishment, the task of verifying cor- 
rect fimctionality was far from completed wtien this was 
dcHie. 

Numerous tests were developed specifically for the J/K-class 
i^stems. These lests employed a number of techniques for 
finding defective components and defects in flesign. These 
teclmiqdes included pseudormidom and pseiidoexhaust ive 
code and data sequences that stressed lliv i>rocessors (inte- 
ger units, lloaUng-point units, caches, program cotitrol, etc.), 
tiie memory and memoiy controllers, and t!ie 1/0 bus adapters 
and I/O rontroUer cards. 

Electrical Qharacterizatiofi. Once a minimal level of system 
furuiionnJity was attained, several electrical chimicteriza- 
tion efforts were latmched to prove that the ccjmponents 
and the system wtjuld ftJUCtion in the electrical en%iron- 
menl. Tliis testing focused on measuring el€K:tric:al noise on 
(*hips as well as Ijoimis, anrl looking at bus cross talk and 
power sup|)ly variation atifl noLse. Systems were sdessed 
Ijeyond nonnai temi)eratnre ranges, voltage ranges, and fre- 
quency ranges to find the weakest link in Ihe system electri- 
cal environment, Tlirough all this characterization effort, 
designs were modified and improved, residting in a system 
that is capable of romping reliably throughout the specified 
system operating environment. 

System Validation. Because of the desire to stress the system 
beyond wbai i IP's customers wll! do, the functional and 



)Copr. 1949-1998 Hewlett-Packard Co. 



l^t!bruary 19fl6 tie wleU- Packard Jouma] 1 S 



K- Class Power System 



The power system m the HP 900D K-class servers uses a number af new end 
emergmg technologies to actiievB excellent platform performance without com- 
promising cost, reliability, and quality metrics Combined m the power system are 
the system power monitor the system power supply, and an optianal uninterrupt- 
ible power supply {UPS) Key contnbutions of ttie system power monitor include- 
system turn-on and initialization indudmg error reporting via a front-pa net LCD 
display, temperature monitoring and cooling, fan speed cantrol based on ambient 
temperature, fan synchronizatwn and fauJi detection, continuous power supply 
output voltage monitormg, special manufacturing modes of operation, overtem- 
peratiirfi detection and wamir^^. overtemperaLure shutdown, and other features. 
The system power supply uses power factor correction to achmve \n^ power-line 
distortion while nraximizing the avaifable VA capacity of the input ac circuit. A 
standard dc-to-dc fonArard converter follows the regulated power factor corrected 
output Remote sensmg is used on all output rails to achieve tight regulation 
specifications, The power sysEem is optimized for use with several HP UPSs 
employing both offline and online technologies. The UPSs use an autorangmg 
technology allowing worldwide use Worldwide regulatory and safety approvals 
apply to itiese UPSs. The hardware provides powers ine filtering and conditioning 
while the hrmware provides many usetui status and control capabilities, both 
real-time and programmed for later execution. 



To LCD 




ViiElafes 
Temperature 



Sysiern Logic 
Outputs 



Fan A 



FanB 



Rg, i System power monitor btockdiigram, 



System Power Monitor 

The system power rr,Qmtor (Fig. 1 1 is where the power system gets its HP person- 
aiity. \l wa$ intended that most if not all nonstandard features of the power sys- 
tem would be concentrated in tbis assembly, as opposed to having them in the 
power suppfy itself. The power monitor is designed around a microprocessor, so 
that most of its features are determined by firmware. This nnade \i convenient to 
modify these features as required during the system development phase, without 
changing hardware, The power monitor is powered by a dedicated +1 5Vdc supply 
which IS turned nn gt al! tsmes if the system has ac power Tne functions of the 
power monitor are: 

> Checl? the CPU modules in the system to see that they are alt compatible with 
each other. 

* Check the power supply in the system to see that it is compatible with the CPUs 
present. 

I Respond to system keyswitch position and turn power supply on and off as 
required, 

I Monitor all power supply output voltages for valid range. 

t Monitor ambient temperature and initiate operator warnings. 

I Control fan speed as a function of ambient temperature, 

I Synchronize the two fans to avoid acoustic beating. 

t Check for fan failure. 

I Monitor internal system temperature for valid range. 

I Initiate system reset signals 

I Issue ac powerfaii warning signal. 

I In case of any system nrsa I function, shut down the system and write a message to 
the front-panel liquid crystal display indicating why the system was shut down. 

The notable contnbutians of tlie power monitor are its fan control scheme, which 
makes the system remarkably quiet for its power level, and its oontribunon to 
system maintamabrlity through diagnostic display messages. 

System Power Supply 

The system power supply is mted for 925W of continuous dc output Five output 
rails are provided: +3.3DVdc (Vgil +*-4Vdc (Vpaf, -5 IVdc [Vdd), +lZVdc, and 
-12Vdc. The +3..3Vdc and +5.1Vdc rails are used for standard logic circuits white 
the +4.4\/dc is used exclusively for the CPUs. The +12\/dc is used primarily for disk 
drives and I/O with the remaming -IZVdc rail being used strictly for l/Q. All rails 
have ±1.5% regulatjon windows. Additionally, a ^IBVdc, 300-mA rail is provided 
for use by the system power monitor. This rail is electrically isolated from the 
computer rails Its single point of ground is provided hy the power monhor, which 
eliminates the potential for ground loops. The system power supply implementa- 
tion is done entirely in discrete devices with one hybrid, four daughter cards, and 
a 2.6- mm -thick, HP FR4 motherboard. The density is 1 B watts per cubic inch. Both 
a discrete version and a dc-to-dc module a po roach were inrtiaily investigated, but 
cost, cooling, and reliability concerns ultimately resulted in the discrete version 
being chosen 



electrical characteiization efforts mainJy focused on test 
sofhvare aiid en\ironmeiits that do not match r:>iir cirstoni- 
ers' operatmg conditions. VVlnle it ifs Likely llial all haidwaip 
defects (design related as well as nnamjfactunng related) 
will l>e found with the methods sho\\7i above, it is not 
known if the new liarrlware might tnicover software defects. 
At the same time, it is possible that actual system software 
Bind applications could im cover hardware defects. For tins 
reason, each system is tested under various ]oad conditions 
and system configurations while rimning actual HP-UX and 
MPE/iX application jirogniins tmd system exercisers. These 
efforts result Iti a system that fias been designed to operate 
reliably in normal operating eonditions as well as under 
extremes of ennronment. 



Conclusion 

Tlte J/K-class family of workstations and senders takes a big 
step in the direction of converging HPs workstation and 
server Lines. At the same time, the J/K-ckiss provides leader- 
ship performance at exceptional \'aiuc to computer systems 
users, 

Ac kno wle dgm eo ts 

The authors wish to thank Scott Stallard. R&D lab manager 
at the begiiming of the development of the J/K-<*lass sys- 
tems, as well as John Spofford. R&D lab manager dttring tlie 
final stages of testing ^md verification, for their support, 



1 6 February 1 996 Hewlf tl-Packarci Jm im d 



)Copr. 1949-1998 Hewlett-Packard Co. 



1 100/1 ZOVac. i bA Dfancft cifcuit Tq mify _ 

customef instaJlatton o! 3 204 tirgncJi cira : .,_.__, . _ r _ . 

jrmn cydB that the system power supply woufd use power factor trnm^utm Ih: 
m^ms that the input vt:-^^"^ ^"^ - 'refit wavefmms ate m phase, so the pgvk^ 
supply a|!pears to be a in ttw ac Sme By companson, tiBtlftionsI 

offline swttchere apfffiar ic neai ne as L- verv 

larp ' spikes ' of tnput airrent 5t f^e oee^ . Vfth 

the system power supply apf-?- " -}, ps^er rs a^aTrr. i^u ui 'j;c dc line 

conttnuously rattier than I ust c ^^.i WithmJt power fa^rtar correct ton, 

typical Ejfflirie switchers are limiteti 10 approximately SOOW g.-ven 1 0OVac input 
lines, Power factor ojrrBdion also allows the supply to operate over a wide mngi 
of input ^^Itages wtthQut requiring any additi&nal citciiitrv such as autorar^ing 
circuitry or line select swrtches. "Rie regifiami ou^ut voltage of the power factor 
CDrrection circuit is -i-TOVdc. The supply \% ra^^ to operate with mput voltages 
from 90Vac to 14DVac for its lower operating rartge and from IBOVac to 264Vac in 
its higher operating range. The frequency range of operation in either vaftige 
range is 50 to 60 Hz. There is a minimum guaranteed carryover time of 20 ms 
before a powerfail warning ss issued and an adcfitiorial 5 ms of tarrynver time 
after a pDwerfail is issued. Another benefit of using power factor cDn^ection is that 
European norms for Ime distortion are already met when they become mandated 
in the European Community. 

Two forward dc-to-cic converters are employed in the power supply. Both convert- 
ers take the regulated 4400 Vdc output of the power factor correction stage and 
convert it to the desired regulated output voltage With the Vod rait exceeding 
500W it made sense, considering component selection and cost, to have Vq^ 
generated by one converter and the remaining rails by a second converter which is 
rated m about 425 watts, The use of two converters also allows sequencing of 
the VpD rail with respect to the VpH r^iE, which was a semiconductor requirement. 

Two output connectors are required for busing power between the power supply 
and the system board, The footprint of the connectors measures only six square 
inches, so the impact of the power system on the system board layout was 
minimal 

Uninterruptible Power Supplies 

Unlike many prevjous HP systems which used battery backup of only mam memory 
during short duration ac power failures, thus halting any processes in progress, 
the K-class power supply uses uninierrupiihle power supplies (LJPSs^ for backup 
This allows uninterrupted operation during an ac line failure for some predeter- 
mined period of time after which the computer can be automattcally and control- 
lably shut down Should the power be restored before shutdown is required, 
processing wifl have continued uninterrupted Should shutdown be required 
because of an extended power loss then the computer can do a controlled shut- 
down programmatically after which the UPS can be shut down, This con iro liable 
turn-off of the UPS and host computer is well-suited for applications in which 



jittng dawn equipment 



"Twticafly {T^mrgrtt bt over tne weekend 



: B from HP The iDwef-pcrwer unfts^ — 600VA 
An et5~4ahT^ ^ 3.vyiH /i^^iAA '?c^moum, and 

TdHcVArioLPUvVifacumoL " ' -'ecisthe 

incorriing flC directiy TO Tht :„. . ^ _. r ■jlsidaofa 

defined set uf voltage and frequency limits Once this occurs the UPS thei 
switches to irwertef made and outputs regulated ac usmg its Ime"^- rirr.-.ric5 3^^ 
e dc-ta-ac ufssffvener This lechrMsiogy is vef>' effectrve, reliable - ^ ^ive. 

and sffici^n for many apptotior>s m which 3 defined toss of ac mpi,: il!- m^ load 
can be supportKi The time period during '^m^ there is fffl ac input is dehned as 
transfer time Tfie offlme units have a transfer time of 10 ms maximum and this 
maps W2\l into HFs computer products whtch have a guaranteed carTyover time of 
20 ms minimum. The offline topology is very enefgy efficient v/hen the ac input is 
within tolerance the UPS is just maimaimng its internal hatienes- Unless the bat- 
teries have been ron down because of an eariier povifsr failure the hattenes are in 
a ■' float" state and ra:|uire very little input power. 

The topology employed by the htgh-power 3-kVA UPS is online interactive In this 
technology ihe UPS monitors the mcoming ac waveform and adjusts it on a cycle- 
by4:ycle basis, interactively regulating the output Bc to the host computer system, 
Should the line deviate substantially outside of its n&rmal range the UPS transfers 
from unhne to inverter mode and continues to provide the load with regu fated ac 
derived from the UPS's internal batteries. This technology provides excellent regu- 
lation of the ac output supplied to the load under all line condtt^ons and ss suitable 
for mission-critical applications where even slight losses of ac input are disruptive. 
The 3-kVA UPS also provides isolation from line for ground- loop- sensitive products 
by means of an isolation transformer. This topology is also very energy efficient 
because the majority of the losses during normal running are localired in the isola- 
tion transformer With proper choice and design these tosses can be greatty re- 
duced resulting in a very efficient design. 

HP's offline units are autoranging in both voltage and frequency and have world- 
wide safety and regulatory recognition This feature allows worldwide coverage 
with just one model per power range. These units have 15 minutes of run time at 
rated load rather than the industry standard of 7 10 a m mutes. The software feature 
set includes programmable on and off times, input voltage- input trecfueney, output 
vottage. and battery voltage, UPS internal temperature monitoring, self-test mode, 
and numsrous other status and warning codes 

HP's online 3-kVA unit provides regytated 230Vac output at either 50 or SO Hi It 
provides 3 kVA or 3 kWof output, allowing full utilization with power factor 
corrected loads. 

Gerald J^ Nelson 
James K- Koch 
Development Engineers 
Systems Technology Division 



leadership, and direction throughout the project. Addition^ 
ally, we acknowledge the contribution of numerous engi- 
neers, prog tarn managers, technicians, and support person- 
nel from one end of the countrj^' to the other for the many 
hours spent in the design and testing of this system. Specia] 
fhirnks to Cluls (.liristopher foi^ clearing obsUicles and mrging 
tis to reach for the best ijmdnct wc con]<:i dc^sign. Confkmiug 
the Quality of their incliistrial cJesign, the J/K-rlass systems 
won an award at the 1995 iF (Industrie Fontin Design ILui- 
over), the wt>rlrrs largest product industrial rlesign fomm, 

Refert^nces 

1. H,A. Peai*son, "A l^w-t/o.st. tligh-Porformance PA-RISt^ Work- 
stadi^ti with Bviilt-in Gra|>hirs, Multimecliat anti Net workitig Capabii- 
hies," Iff'wiHf 'Packard Joiniml, Vol. 40 ^ no. 2. April 1995^ pp, fi-U. 



2. G. Kurpanek, et al, "PA 7200: A PA RISC Processor wit^ 

FntegniLed Bl^b-Peitonnance MP Bus Interface," Compcoji IHge^st 

njPapprs. Ft^bniarj' 1994, pp. 375-382. 

:i D. Hunt, ^Adva^ceri Perfoimaiice Feattires of the 64-Bit PA 8000,' 

Frvcee(Mngs o/Compcon '95, March 1KJ5, pp. 123-1 28. 

4. S. Bainbridge, el al. "Tlieorem Proving as ait Indnstrial Tool for 

System-I^vel Design ," IFIF DmuatUans A: Computer StieiHe and 

'Dscknolog}^: Tfimr^ni P)vmrg in Circuii Desig^is^ June 1092. pp. 

25.3-274. 

HP-UX 9,' and ^0.0 fer HP 9QDG Siries 70Q and SOD computers m X/Oparr Company UWIX 93 

branded pnodiicts. 

UNIX"*' fs a fegistsred trademark in the United States and Other raynrnes. ficensed encfusiuelv 
through X/OpEn Csmpaivy Limiied, 

K/Open'^ «s a reg^stersd trademark and the X device is 3 tfademerk of X/Open Companv 
Limited in the .UK and other countries. 



Febraisiy 1996 Hewleti-Fm:karH Jo 1 1 n ml 1 7 



)Copr. 1949-1998 Hewlett-Packard Co. 



A High-Performance, Low-Cost 
Multiprocessor Bus for Workstations 
and Midrange Servers 

The Runway bus/a synchronous, B4-bit, split-transaction, time 
multiplexed address and data bus, is a new prncessor-memory'l/O 
interconnect optimized for one-way to four-way symmetric 
multiprocessing systems. It is capable of sustained memory bandwidths of 
up to 768 megabytes per second in a four-way system. 

by William R. Bryg, Kenneth K. Chan, and Nicholas S, Fiduecia 



The HP 9005 Kckiss servers and J-class workstations are the 
first systems to introduce a low-cost, high-performance bus 
.stnK'ture named tlie Runway bus. The Runway bus is a new 
processor-menitjry-l/O int.etx uniiecl lliat is ideally suited for 
one-way to four- way syniinetrlc nuiUiprocessing ft>r high- 
end workstation's and midrange servers. It is a synchronous^ 
64-bit, spUt- transaction, time multiplexed addi^ess and data 
bus. The HP PA 7200 processor and the Runway bus have 
been designed for a bus frequency of 120 MHz in a four- way 
multiprocessor system, enabUng sustained nienior>' band- 
Willi lis of up to 768 Mbytes per second without external 
interface or "glue'' logic. 

The goals for the design of the Humvay protocol were to 
provifje a price/performance-competitive bus for one-way to 
four-way multipj^ocessing. to minimize interface complexity, 
and to support the FA 7200 ajid fut ure proce.ssors. Tlie Run- 
way bus achieves these goals l)y maxlnuzing bus IVequejicy, 
pipelining multiple operations as much as possible, and 
using available bandvtidth very efficiently, while keeping 
complexity and pin count low enough so that the bus inter- 
face can be oUegrated directly on the processors, memory 
controllers, mul i/0 adapters that connect to the bus. 

Overview 

Tlie Runway bus features multiple outstanding spUt transac- 
tions from each bus module, predictive flow control mi effi- 
ciem distribuLed pipelined mbilration scheme, ^md a snoopy 
coherency protocol.^ which allows flexible coherency check 
resi>onse time. 

The design center application for the Rimway protocol is the 
HP 9000 K-class midraJtge server Fig. 1 shows a Runway 
bus block diagram of the HP ftOOO K-class serven The Rim- 
way bus comiects one to four PA 7200 processors with a 
dual I/O adapter and a niemoiy controller through a shared 
address and data bus. The dual 1/0 adapter is logically two 
separate Runway modules packaged on a single cliip. Each 
1/0 adapter mierfaces to the HP HSC I/O bus. The memory 
controller acts as Rmiway host, taking a central role in ar- 
bitration, flow control and coherency through use of a spe- 
cial chent-OP bus. 



The shared bus portion of the Runway bus includes a 04-bit 
address and data bus, master IDs and transaction IDs to tag 
all transactions uniquely, address valid and data valid signals 
to specify the cycle tyiw, and [jarity protection for data and 
control. The niemory controller specifies w^liat types of 
transactions can be started by {lri\ing tlie special client-OP 
bus. which is used for flow control £md me mo 171^' arbitration. 
Distributed arbitration is implemented with uni directional 
wiies from each module to olliernindules. Coherently is 
maintained by having all modules report coherency on dedi- 
cated unidirectional wires to the nieniory controller, which 
calculates the coherency response and sends it with the 
data. 

Each transaction has a single-cycle header of 64 bits, which 
minimally contains the transaction type (TTYPE) and the 
physical address. Each tiansaction is identified or tagged 
with the issujng module's master ED and a transaction ID, 
tlie combination of which is unique for the duration of the 
transaction. The master ID and transaction ID <ire trans- 
mitted in parallel to the main address and data bus. so no 
extm cycles are necessary lor the trai^smission of the mas- 
ter lt> and transaction ID. 

The Runway bus is a spht-transaction bus, A read transac- 
tion is initiated by transmiUing the encoded header, which 
includes the address, bilong with the issuer's master ID and a 
unique Iransaction ID, to all other modules. The issuing 
module tlien relinquishes control of the bus, allowing other 
modules to issue their transactions. UTien the data is avail- 
able, the module supplying the data, typically niemoiy, arbi- 
trates for tlie bus. then transmits the data along with the 
master ID and transaction ID so that the tlie originai issuer 
can match the data with the particular request. 

Write transacrions are not split, since die issuer has the data 
that it wants to send. The singieK'ycie transaction header is 
followed immediately by the data bemg wTitten, using the 
issuer's master ID and a unique transaction ID. 

Fig. 2 shows a processor issuing a read transaction followed 
immediately by a write transaction. Each transaction is 
tagged Vkith the issuing modules master ID af? well as a 



1 8 February 1 996 Hewlett-P^kard Joy niii 



)Copr. 1949-1998 Hewlett-Packard Co. 




JUlCtt DATA, ciE. Jum mi 



COH = Coherency Signal Lin^s 
MMC? Master Memorv CttntroJIfir 
lOA ^1/0 Adapter 



Fig* 1. HP £JO«Jil K-t:Jyss ser\^er Runway bus block (iJagrarti. 



AODR DATA. etc. AUB DlTT 



AQOfl IIAIA.€i^ ARB OUT 



ADDR DftTA, elc ARB OUT 



AODfl DATA.frtc AJIB„0IIT 



COKO J0A4,5 



-r* 



-I — • 



' .4 



traiTsaction ID. This combination allows the* data response, 
tagged witJi the saine information, to be directed back to tlie 
issuing module without the need for an adtiitional address 
cycle. Runway protoccjl allow^s each mocfulc to have up to M 
transiic tifins in progress at one tintc. 

Arbitration 

To miniinisse arbitration latency without decreasing maxi- 
mum t>ii.s frequency, the Runway bus has a pipelined, two^ 
state aibitmtiot^ sclu^ine in which the deteimination of the 
arbitration winner is distributed among all modules on tJie 
bus. Eacli module drives a imiquc arbitration request signal 
and receives other modules* tu"! titration signals. On the first 
arbitration cycle, all interested parties assert tiieir mHiitra- 
tion signals, an(i the memory controller drives the t^hent-OP 
control signals (see Table 1) indicating flow^ control informa- 
tion or w^hether all modules are going to be preempted by a 
memoiT data rettrnr Durhtgthesecfnid cycU\ all modules 
evalualc Ihe infonnaliou received and make the unanimous 
decision about who hits gained ownersliip of the bus. On the 



Afbitration 

Address/Data 

Master ID 
Transaction ID 



Fig. 2. A |ir(K"f*ssor JNKuitig n vpiui irAmnviii^u [ullovvt^d 
iiiUTietJmirly by a write iraitsiifruoti on the Kuawuy bus. 



Arfi 


Art 
































Rd 


Wr 


DD 


D1 


02 


m 








Dfl 


D1 


DZ 


D3 






Z 

4 


Z 

5 


2 

S 

Trai 


I 

5 

NrH 
isac 


2 

S 

tion 


s 








2 

4 


Z 

A 

Het 


Z 

4 

ail 

urn 


2 
4 





Table! 
Client'OP Bus Signals 


ANY_THANS 


Ai^V transaction allowed 


MQJO 


Any transaction allowed except CPU 

to yo 


flETURNS_ONLY 


Return or response transactions 
allowed 


QNE^CVCLE 


Only one<'ycle transactions aUowed 


}^JONE_ALLOWED 


No transactions allowed 


MEM.GONTf^OL 


Memory module controls bus 


SHAREO„RETLJRN 


Shared data return 


ATOM J C 


Atottiic owner can issue any trans- 
action: other ntodules vmi only issue 
response transactions. 



third Runway cycle, the tnodule that won arbitration drives 
die bus. 

With distributed arbitration instead of centralized arbkra- 
tloHt arbitration infomiation only needs to flow once be- 
tween bus iei|iic^sters. [Jsing acenlralized arbitration unit 
would nvjuire infonuation Icj fli^w twice^ Hrsl hctweeu (he 
requester ;utd the arbiter and then betweeji tht- ai'bitcriUtd 
tjie wuuter, adding extra latency to tiie arbltjation. 

Distributed arbitnition on the Runway bus allows latency 
lietween arbitration and bus access to be as short as two 
cycles. Once a module wins arbitration, it may optionally 



I Vhnjiir>^ 1 IHHi ! k-v,' ivUWn karf 1 hi iini al 1 9 



)Copr. 1949-1998 Hewlett-Packard Co. 



assert a special hn^/ tmnsaction signal to extend bus owner- 
ship for a limited number of cycles for certain transactions. 
To maximize bus utilization, arbitration Is pipelined: while 
arbilration can be asserted on aiiy ryole, it is effective for 
the selection of tiie next bus owner two cycles before the 
current bus owner releases the bus. 

.Arbitration priority is designed to maintain fairness w^hile 
delivering opt inial performance. Tlie highest arbitration 
priotity is a J ways given to the current bus owmer tluough 
use of the lung transaction signal, so that tiie current owner 
can finish whatever transaction it started. The second high- 
est priority is given to the niemoiy controller for sending out 
data retmrns. using the client- OP bus to take control of the 
Eunw^ay bus. Since the data retum is rhe completion of a 
previous split read request, it is likeiy that the requester is 
stalled waiting for tiie data, and the data retun> will allow 
the requester to continue processing. The third tiighest ar- 
bitration priority goes to the I/O adapter, which re<iuesls tlie 
bus relatively uifrequently, but needs low iateiicy wlien it 
does. Low^est arbitration priority is the praeessors, wbich 
ujse a round-robm algorithm to talce turns using the bus. 

The arbitrarion protocol is implemented in such a way that 
higher-priority modules do not have to look at the arbitra- 
tion request signals of lower-priorhy modules, thus saving 
pins and reducijig costs. A sicfe effect is that low-priority 
moduh^s can arbitrate for the bus faster thaji high-priority 
modules when the bus is idle. Tius helps processDi"s, which 
aj'e the main consumers of the hus, and doesn't botiier the 
memory controller since it can predict wlien it wdll need the 
bus for a data return and can start arbitratuig sufficiently 
early to account for the longer delay in arbitration. 

Predictive Flow Control 

To niake the best use of the available bandwidth m\d greatly 
reduce complexity, transactions on tire Rmiway bus ai-e 
never abort:ed or retried. Instead, the client-OP bus is used 
to rommunicate what transactions can safely be initiated, as 
showm in Table I. 

Since the Runway bus is heavily pipeUned, there are queues 
hi the processors, menu>ry controllers, and 1/0 adapters to 
hold transactions until tiiey can be processed. The chent-OP 
bus is used to communicate whether tJiere is sufficient room 
in these queues to receive a paiticulai' Idnd of Iraiisaction, 
Through \arious n\eans. the niemor>' controller keeps track 
of how ixiuch room is remainuig in these queues and restricts 
new tiansactions when a particidar queue is critically full, 
meaiimg that the queue would o%'erflow if all transactions 
bemg staned iu the pipelme plus one more all needed to go 
mto tJiat queue. Since the inen^oty controiier "predicts" 
when a queue needs to stop accepting new transactions to 
avoid overflows this is called predictive flow cordwL 

Predictive Oow control increases the cost of queue space by 
ha%i[ig some queue entries that are almost never usedt but 
the effective cost of queue space is going dowT\ with greater 
mtegration. The primary^ benefit of predictive How control is 
greatly reduced cortiplexity, since modules no longer have to 
design ui ti\e rapabiliiy of retrying a transaction that got 
aborted. Tliis also improves bainth^idih smce each tiansac- 
tion is issued on Uie bus exactly once. 



A secondary benefit of predictive flow control is faster com- 
pletion of transactions that must be issued and received in 
order, particularly wTites to 1/0 cie vices. If a transaction is 
allowed to be aborted, a second serially dependent trans- 
action cannot be issued until Uie first transaction is guaran- 
teed not to be aborted. Normally, this is after the receiving 
module has had enough time to look at the transaction and 
check the state of its queues for room, which is at least sev- 
eral cycles into the transact ioji. With predictive flow^ control, 
the issuing module knows when it wins arbitration that the 
first transaction will issue successhilly, and the module can 
immediately start arbitrathig for the second transaction. 

Coherency 

The Runway bus provides cache ai\d TLB (translation took- 
aside buffer) coherence with a snoopy protocol. Tlie proto- 
col mtiuitains cache coherency among processors and 1/0 
modules witii a muiimmn amoimt of bus traffic w^hile also 
minimizing the processor complexity required to support 
snoopy muitiprocessmg, sontetimes at the expense of mem- 
ory controller complexity- 

The Rim way bus supports processors with four-state 
caches: a line may be invalid, shared, private-clean, or pri- 
vate-dirty. An invalid line is one that is not present in cache. 
A line is shared if it is present in tw'o or more caches. A pri- 
vate line can only be present in one cache; it is private-dirty 
if it has been modified, private-clean otherwise. 

Wlienever a coherent transaction is issued on the bus, each 
processor oi' 1/0 device {acting as a thirf! party) performs a 
snoop, or coherency check, using the virtuaJ index and phys- 
ical address. Each module then sends its coherency check 
status directly to the memoiy^ contioUer on dedicated CQH 
signal lines. Coherency status may be COH_OK, w^hich means 
til at either the line is absent or the line has been invalidated. 
A coherency status of COH_SHR means that the hne is either 
ahready shai'ed or has changed to shared afier the coherency 
check. A third possibility is COH„CPY, which meaiis the third 
party has a modified copy of the line aiKl will send the line 
direcUy to the requester Fig. 3 shows a coherent read trans- 
action that hits a dhty Une ui a tinrd patty's cache. 

After the memory controller has received coherency status 
hrom every module, it wWl return memory data to the re- 
plies tei' if the coherency status reports consist of only 
C0H_OK or CDH JHR, If tmy module signaled COH_SHR. the 
memory controller will inform the requester to mark the line 
shared on the client-OP hues during the data return. If any 



Arbitration jArb 

AddfeWData M 

Master ID 3 

Transaction 10 1 

CCC ProcO 
CCC Prod 
CCC Proc2 
CCC PrDe3 



CCC = Cache CDhergnc:^ Check 
C2C = Cache-to-Cac he Transfer 



OK 



Aril 



OK 
fcPV 



OK 



jC2C: 
1 
7 



Fig. 3. A coherent read transaction hits a dnty line in a diird 
partsrs cache. 



20 FebnlHiif 1866 Hewlett-Packard .TouniaJ 



)Copr. 1949-1998 Hewlett-Packard Co. 




i 



io9d 



lewd iffiss -> And .Slai«4_ar_^iTvcb TrainaciiDn 



Uae StBie at fhli«t CPU^ 
Prrvaifi-DirtY 



Olt^ftf CPU Action: Bits Sifftsl: 
Stn Invalid COH_CM 

tiwalkfBte CQH CPV. C3!C. 



LiimSteteatCHfaerCPU 
Shared 
Privaie-Clean 



Other CPU Action iys Signal; 
StavShftred COH SHB 

Make Shared CQH.SHfi 




Siwe Mi» -> |^ea4_Prhr9^ Transaciifiii 



Um State at Other CHI: 


Other CPU Actio nr 


Bus Signal: 


Sltarad 


Invaftdali 


COH OK 


Priv^te-Cie^n 


InvalJdaie 


tm OK 


InvBJid 


Stay Invalid 


CDH OK 


Private Dirty 


Invalidaie 


COH CPV, DC Trtnde 



Sliifv 



Load or Slare 



Frtvate-Dirty 



^g* 4, Cache state ttBn^tions 
resiiliin£ from CPU insirudjons, 
Tlie : vvthe 



at other CWs that caused the 
tiansitiotu as vi^ll as the effect on 
the other CPtrs state. For exam- 
pie, from the invalid st^te, a load 
miss will always i:^use a 
Read_Share8|_Dr_Ptiv3te li^nsaction, 
Ttte final state for the load miss 
vrfll be either private-clean (if «'ui' 
otl^er CPU had the cache line in- 
valiti or private -dirty) or shared 
(if another CPU had the cache 
line shai'ed or private-deaii). 



Tiiodule signals COH_CPY, however, the memory controller 
uill discard the mentor>^ data and wait for tlte third part>' to 
send the mo<Jified cache line directly to tlie n?quester in a 
C2C„ WRITE transaction. The memor^^ controller will also write 
the modified data Ja tnemoi;;^' so that the requester can mark 
tlie line clean itistead of diny, freeing tlie requester (and the 
bus) from a subsequent write transaction if the line has to 
be cast out. Fig. 4 shows cache state mmsi lions residting 
from CPU instnictions. Fig, 5 shows transitions resulting 
from bus snoops. 

Tlie Hunway coherency protocol supports nmltiple out- 
standing coherency checks and allows each module to sij*- 
nal coherency status at its own rate rather than al a fixed 
latency. Each module maintains a queue of coherent trans- 
actions received from the bus to be processed in FIFO order 
at a time convenient for tlu? module. As long as the coher- 
ency response is signal etl before data is aval lab Ip from ibe 
memory controller, delaying the coherency check will not 
increase memor>' latency This flexibilily allows CI*lIs to 
bnplenient simple algorittnns to schedule their coherency 
citecks so i\s to mininiiKC conflicts with the instmctton pipe- 
line for cache access. 



^ 



Transaction: Bus Signal: 

II«deI Frivale CDH.OK 

Read Shtred or ftivale COH OX 

tmjm. 




RQqd.Privniir COH.CPY 

R««d_SliBriHl_i]r.PtivBts CDH^CPY 



Pfivala-Oiny 



\lrtiiaJ Cache Support 

Like all previous HP nnilti processor buses, \irtually indexed 
cacti es iu-e supp oiled by haiing ail coherent transactions 
also transmit the virtual address Ijits that are used to index 
the processors* cac^hes. Tlie tvs^ehe least-significant address 
bits are the offset witJiin a virtual page and are never 
changed when translating from a virtual to a physical ad- 
dress. Ten virtual cache index bits iire Iransjiiitled; tliese are 
added to the twelve page-offset bits so that virtual caches up 
to 4M bytes deep (22 address bits) can be supported. 

Coherent I/O Support 

Rimway I/O adapters take pari in cache coherency, winch 
allows more efficient DAL\ transfers. Trilike previous sys- 
tems, no cache flush [oop is needed l>efore a DMA output 
and no cache purge is needed before IJMA input can begin. 
The Rimway bus protocol defines DMA write transactions 
that both update memor>^ with new data lines and cause 
other modules to invalidate data llial may still reside in their 
caches. 

Tlie Rimway bus supports coherent I/O in a system with 
virtually indexed caches. I/O adapters have small caches 
and botli generate and respond to coherent iransjict.ions. 
The I/f ) adapters bave a lookup table (I/O TLB) to attach 
vinual index information to I/O reads and writes^ for both 
r)MA accesses and control accesses. For more infonnation 
see the article on page 52. 

Coherent I/O also reduces tlie overhead associated with ilie 
load-and-clear semaphore operation. Since all noninstruc- 
fion arressrs in the system are coherent, semaphore opera- 
tions are ]>ert(>rnied in the processors^ and I/O adapters' 
c^irhi's, The piiK essfjr or I/O adapter gains exclusive owner- 
ship mvd aiomieally performs the semaphore operation in its 
own cache. If the line is already private in the requester's 
cache, no bus transaction is genenited to perform the opera- 
tion, greatly improving perfoniiaiice, Tlie memor>' font roller 
is alsr> simplified because it doc\s not need to support sema- 
plujre (Operations in memory. 



Fig. 5, Ciir In? state trmisitions resulting front bus coherency 
checks (snwps). 



February 1 996 HewletM^kard Jan nial 2 1 



)Copr. 1949-1998 Hewlett-Packard Co. 



Runway Bus Electrical Design Considerations 



The Runway bus's higli bandwidth ts a result of the Sirategies adopted far its 
electrical design These included an efficient data transfer scheme, a simpEe clock 
system with tow clock skew, a compact bus topalogy, and a termination strategy 
that eliminates dead cycfes when changmg bus masters. 

Data TraEisfer Scheme 

The fi^mple data transfer strategy, shown m Fig 1 , aNriws most of the cycle to be 
used 10 ifansfer data. An edge -triggered Runway pad driver is enabled by the 
rising edge of the on-chip Runway clock, RCK, caLising the data to be driven nntn 
the externa] bus^ Tnis driven data ]S then (atched one cyde later at the receiving 
devices by the ne;<i rising edge of ttie receiver's on-cbtp Runway dock On each 
Runway VLSI chip, the same physical dock edge is usad to trigger the signal driver 
and latch the data from the previous cycle in the signal receiver. 

The foltowing two equations express timing constraints that must be met for 
pfopef operation' 



Setup time equation. 
Hold time equation: 



DR!VE^,, + SKEW4SU ^J^enuti 



SKEW 4 HOLD < DRtVEm 



where DRIVE is tfie delay from the rising edge of RCK at the driver to the time 
when the data ts vafid at the receiver SU is the receiver setup time. HOLD is the 
receiver hold time, SKEW is the maximum skew of the clock signal f RCK) between 
the driver of one chip and the receiver of a nuttier, and J\jerm '^ ^^^ ^^^^^ period. 

Clock Path 

fhe Runway clock orchestrates the transfer of informatior^ among the components 
on the bus. The path of the Runway clock to the dnver and receiver circuits can be 
divided into three components' on-board clock generation and dEStribution to the 
VLSI chip inputs. on-cNp dock reception and buffering, and on-chip clock distribu- 
lion to the Runway driver and recesver circuits. 

Skew can be introduced by any of these components. Inspection oi the setup and 
hold time equations reveals that it is teirable to reduce skew to as ^maSI a value 
as possible. 

The clock path begins at the custom VLSI dock generation chip. This chip gener- 
ates several differential pairs of clock outputs, one per Runway VLSI chip. By us 
one chip as the source of the clock signals in this system, the output- to -output 
skew was kept very small, 



■ Tperj&d 



using 



RCK. 



r 



Driver Drives Here 



""V_ 



DHIVE„ 



DRIVE n 



DATA 



xxxxx: 



0a^Va|J4»tf>^ 



SIJ tiOLD 

h M— H 



XKX 



t 

Receiver Latches Here 

Fig, 1, Tfie data transfer timing strategy for the Runway bm aUovifS most of the cyde 
to be (is€d to transf Br dsra. 



Each dedicated dock pair is carefully routed on the printed circuit board to its 
Runway VLSI chip The traces are adjusted m length so that the arrival time of 
each clock at the pins of the Runway VLSI chips can be accurateiy placed with 
respect to the others. Because of known timing differences in the paths of the 
clock from Input pin to driver and receiver circuits for each type of Runway VLSI 
chip, tt is useful to he able to tune the docks m this manner More will be s aid 
about this later 

Each differential clock signal fs received at each chip by a receiver/btrffer circuit 
that transforms the signal into a singie-Ended signal RCK with normal CMOS 
voltage levels This RCK signal then fans out to ail the Runway signal driver and 
receiver circuits iocated at the pads and the associated interface circujiry located 
in the core of the chip Since the interface circuitry \s similar on ail three types of 
Runway VLSI chips, the capacitive Joading on RCK is nearly identtcal for all three 
types, which ensures iliat the delay through the clock buffer is similar for all Run- 
way VLSI chips. 

The RCK signal is routed using various techniques to reduce the distribution delay 
and thus the variation in delay The clock receiver/buffer bit slice is centrally 
placed in the interface so that the total distance the RCK signal must travel on- 
chip to the farthest signal bit shce is minimized. This dock is routed in wide metal 
so that the delay along this line is low, The signal pad ordering in the Runway 
interfaces for all of the Runway VLSI chips is nearly identical. Tliis ensures that 
the distance from the clock buffer to a signal pad is the same for all of the Runway 
VLSI chips. 

The goal in the design of the Runway clock system was to have the on-chrp elciek: 
RCK arnve at the corresponding signal driver or receiver at the same time at^di 
Runway device. Since the CPU is fabricated in CMDSt4. a faster technology than 
the CMDS26 process used for the l/D adapter and memory controller chips, the 
on-board clock signal to the CPU is delayed to account for thts known timing dif- 
ference. Thus the dock skew is only a function of the CIVIOS26 parameters, which 
keeps the skew to a minimum: 

Dverell, the total chip-to-chip clock skew on RCK at the signal driver and receiver 
circuits is under Tl ns worst-case. 

Bus Topology 

The components on the bos are designetf to be' close together to limit the capaci- 
tive and inductive load on each Runway signal line. The setup and hofd time equa- 
tions can be used to determine how best to lay out the signal path The require- 
ments of the two equahans sometimes conflict. For example, the setup tima 
equation wants us to minimize DRIVEmax aod the hold time equation wants us to 
ma^iimize DRIVEmirt. fn plain English, we want an interconnect scheme that mini- 
mizes the overall trace length while maintaining the greatest separation between 
components. 

An ideal connectton topology would be a star with the devices placed at the tips 
of the star as shown in f^g. 2a. Because of manufacturing difficulties with this 
topology, the modihed star shape shown in Fiy. 2b, which fits comfortably ustng 
standard printed cjncuit board technology, was chosen As the figure suggests, the 
main trunk of the Runway bus consists of a standard printed circuit trace running 
along a backplane with at most tour daughter cards attached to the backplane, 
two per side Each daughter card will hold one CPU The memory contra Her and 
the I/O adapter reside on the backplane along with the dock generation circuitry. 
This connection scheme interconnects six Runway devices with less than 3 inches 
of total printed circuit trace for the longest signal with no two devices farther 
apart than 4,5 inches. 



The Runway bus litis botli full-line (32-b5^te) and lialf-liiie 
(l(>b>te) DMj\ input tTiinsactions, ctdled WRITE_PURGE and 
WRITE! S^PURGE. Both transactions write the bpeeified amount 
of data into nienioi>^ at the specified adckess, theit invalidate 
any copies of the eiuire line that may be in a proce.ssoj' s 
cache, llie full-litie WRtrE_ PURGE is the accepted method for 



DMA inj^ut on systents that lia\'e coherent I/O, if the ruU hne 
is lieing wntten. Tlte haiMine WRITE16,PURGE is used tor 
10-b>te writes if enabled via the fast DMA attribute in Lite 
I/O TLB. Sofhvaie progi-anis 1 he I/O TLB with the fast amib- 
ute if it knows tlmt botli halves of the line will be overwrit- 
ten by the DMA. Otherwise, if ttie I/O TLB does not specify 



22 Febrdfliy 1 99G Ht-wlen -Fat kard .loumal 



)Copr. 1949-1998 Hewlett-Packard Co. 



Tefmi nation Strategy 

- la'i i^ ■-' . ->^- - -^ _.____. ______ =? usuaWy i^©I for high- 

iL—ji :ei:. ' : luv^d. lliai rs, the re- 

i?rofegatino 

rieeO> lo tJf)vc in ont; tlifeciion Jhe ifiirtiiri- 
wav ift/t^n ttie driver tristafei 0^ lyms oft 'A 

flows constantiv. When the dnvBf turns off. the bus is disturDed tjy the change ot 
ctifre;it ihougli ifie inductive iratss and bond wares This dis^yrtjance sends a 
wave pjopagating di?wn itre irafTsmfssion line in the direcinHi opposite to the 
directtCKi of propagaiion when the driver turns on A sijecisi frHjiiency-Iim fling 
case is the master char>3eover, when a drtvtr at ifie end of the bus starts to drive 
the same value that was driven by the masler at the other end of the fays in the 
previQus cycle. Inihas €8S6, consiructfve inieffererKe of the two propagating 
waves may Dsuse the bus to take a \mq ime to settle !t is not uiicommon to 
insert a dead cycle in the protDco] to allow e^tra time for the bus to senle when 
the bus changes masters 

On a series- term ma ted hm. the bus drwer has ihe ahiiity to drive in both direc- 
tions. The on impedance of the driver transistor acts as the term i nation resistor 
The driver transistor will turn on and drive the bus tQ the desrred level Near the 
end of the cycle when the bus is nearing its hnal value, fhe drivers will be sourc- 
mg or sinking only a small fraction of their peak currents B\ the start of the cycle. 



Mentary 
CotitrolJef 



ft) 




Memory 
Conlrol^cr 



m 




Fig. Z la) Star tPpalD§y Eb) Modifiod star lopology qI the Runvvay bus. 



Because of th^s. wt>gn tt^ driver is disabled at the end of the cycle, there rs v^ 
I ittfe distiirtjance i n i'^ - . . ;| piKSible to have anott^ " ^*ef 

the b^is in the very r^ oecause the ori-inipedar ?r is 

jaify wait tD receive the ^e. 

-ay 

Runwa . , <5 vefv cficnoact This me^s tt^t the Ume difference be- 

!,'. it5 of the V s IS relalEvefy sniall compart 

T - Dme Had . nation on the Ruf^v.^,' b*?? 

viE rrngin have been able to itB^tease u>e h^^ency of the bus hy abou" 
SJn£s the dead cycle would have cost us abcol 20 to ^ percent in ban.: 
decided to use series [efminaiion instead. 

Other advantages of series-terminate buses iiKlude good tolerance to impedance 
mismatches and long stubs and no dc povrer dissipation Also, we saved valuable 
hoard space by not having to place resistors on each Runway bus signal 

Simulated and Charactertzed Performance 

Early in the Hiinway bus sinuj tat ions, st became clear dial the result would be 
dependent not only on which driver drove the bus, but also on the current stale of 
the bus The state of the bus is precisely determined by the history of dn\/ers 
driving the bus along with the starting condition of the bus. Since the current state 
of the bus IS mostly determined by who was last dr«i/ing it, all possible pairs of 
successive bus transactions by two drivers were simulated. The symmetry of the 
bus and our ability to predict and eliminate combinations that would not be worst- 
case helped cut down the number of slow-case simulations to 32 A network of 
fast HP 9000 fJlodel 720 and 750 workstations was able to run these simulations, 
each of which normally takes one machine about one hour to run, in about 
4 hours. 

The SPICE model to simulate the worst case was in constanl revision as more and 

more details from the design were implemented. The final model had artwork.- 
eJdracted transistur models for the signal driver and receiyer for each Runway 
VLSI chip and a detailed schematic model for each package trace and hoard con- 
nector The printed circuit traces were modeled using SPICE transmission-line 
primitives 

The final simulated worst -case bus frequency came in at 152 MHz using a fully 
loaded Runway bus Ttie characterized frequancy of rhe bus over the extremes of 
process, temperature, and voltage showed operation of at least 140 MH; The 
majiimum characteriation frequency was hmkted to 1^0 MH? because of the 
limitations of other system components These re.sults gave us the confidence to 
conclude that the Runway bos will work at the 120-MHi frequency goal with 
sufficient manufacturing margin. 

Acknowledgments 

I would especially tike to thank Frerf Eatock for his countless and thorough Runway 
bus simulations. Dave "Spreadsheet" IWahcoat for hrs careful design of the clock 
system which kept skew low. Ken Pomaranski and Tuny Chan for their very com- 
pact hoard layouts, Denny Renfrow for suggesting and designing the full state 
ptedriver, John You den for his incredible understanding of transmission I me phe- 
nomena, Craig Gleason for his invaluable design suggestiorts, and all the other 
great engineefs who had a hand m this design 

Nicholas S Fiduccta 
Member of the Tachnical Staff 
Systems Technology Division 



the fast attribute, the 1/0 adapter uses the slower read pri- 
vate, iiierge. wtitt^ Ixack tniiiH^irtirni, which will siiTely iiierg** 
tlie DMA data witfi atiy rliny i)roces.st>r daia. "llie tis*^ of 
WRITE16 PURGE greatly UuTcase!^ DMA iui>iii batidwiUtfi for 
okk% lejijary I/O ciirds that wsv HT-bytp blocks. 



Dc'Higii Trade-afls 

Tb tTi't tlii^ lic'st i)i'rfomiance fnitvi a low-<'rjsf irvft^iconriecf, 
tlu' bus dosigufrs i-hose a tint*' iiiulliplexe<l tins, so that tltr^ 
.satw^ laiis anti wire.s can be tiseti for botli address iuifi data. 
Sefuirate a<ldrossaad rlata !nLscs wrnild ha\T iiicrejusc^d the 



)Copr. 1949-1998 Hewlett-Packard Co. 



Fc'hnuiry I USJti 1 h' wlerr.-J^at"kar«1 J< uinial 29 



number of pins needed by about 50i^, but would have in- 
creased usalile bandwidlh by only 2TO, Since tile number <jf 
pins on a chip has a strong impact on cMp cost, the* time 
multiplexed address and data bus gives us the best trade-off 
A smaller number of pins is also important to aJlow the bus 
interface to be included on tlie processor cbip uistead of 
requiring one or more separate transceiver chips. 

To get the best bandwidth, tht* designers targeted for tfie 
highest bus frequency that could be achieved withoti! reituir- 
uig dead cycles for bus turnaround. The use of dead cycles 
wotild have allowed a higher nominal frequency, but dead 
eyeies would have consumed 20 to SQ percent of the band- 
width, for a net loss. 

Bandwidth Efficiency 

The Runway bus has a rated raw bandwidth of 960 Mbytes/ 
second, whiclt is derived by takijig the widlii of itie bus and 
multiplying by the frequency: 64 bit,s X 120 MHz 4- 8 bits/ 
byte = 960 ntegab>les/second. However, raw bandvaidth is 
an almost meanhigJess measure of the bus, since different 
buses use the raw band\^idth with greatly differing amounts 
of efficieiK\v. Insteadn buses should be compared on effec- 
tive or usable bandwkltli, which is the amount of data that is 
transferred over time using nonnal transactions. 

To deliver as much of Hie raw bandvi'idth as possible as us- 
able bandwidth, the designers mininiized the percentage of 
the cycles tliai were not delivering useful data. Ti^ansaclion 
headers, used to initiate a transactioUt are designed to fit 
witliin a single cycle. FJata rettims are tagged with a sepa- 
rate tiansaction ID field, so that data returns do not need a 
return header. Finally, electrically, the bus is designed so 
tliat dead cycles are not necessary for master (::hangeoven 
Tlie only inherent overhead is tlie onen^ycle tiansaction 
header. 

For32-byte lines, both read and write transactions take ex- 
actly five cycles on the bus: one cycle for die header and 
fonr cycles for data. It doesn't matter that the read transac- 
tion is split. Tlius, for the vast m^ority of die transactions 
issued on the bus, SCKKj of the cycles aie used to transmit 
data. The effective bandwidth is 9G0 Mb:ylcs/s x 80%. = 
768 Mbytes/s, which is veiy efficient compared to competi- 
tive buses. 

In addition, the Runway bus is able to t!eU%^er its bandvvidth 
to the processors that need the bandwidth. Traditional buses 
!yi:jically allow each processor to have only a single out- 
standing transaction at a time, so that each processor can 
only get at most about a quarter of the available bmid width. 
Runway protocol allo%vs each module — processor or 1/0 
adapter-— to have up to CA transactions outstanding ai a 
time. The PA 7200 processor uses this feature to ha^e multi- 
ple outstanding instmction and data prefetches, so that it 
has fewer stalls as a result of cache misses. When a proces- 
sor really needs the band\\ kith, it can actuiilly get tlie vast 
nisyority of the available 768-Mbyte/s bandwidth. 

High Frequency 

Tlie Rimway protocol is designed to allow the highest pos- 
sible bus frequency for a gi^^en implementation. Tiie proto- 
col is designed so that no logic has to be performed in the 



same cycle that data is transmittjed from one chip to another 
cliip. .\ny logic put into the traitsniission cycle would add to 
the propagation delay and reduce the maxinnim frequency 
of the bus. From a protocol standptjint. for this to work, 
each cliip will receive bus sign ids at the end of one cycle, 
evaluate those signals in a second cycle and decide what to 
transmit, dien transmit the response in tiie tliird cycle. 

To ntaximiEe hnplenientation frequency, the Runway bus 
proje<'l took a sysl em-level design approach. All modules on 
the tins Lirirl \]ir hi IS itself were designed togethcT for opti- 
mal lu^riiiiiii^iiiv i\ Liisitead tjf tiesigning an interface specifi- 
cation pemiittmg any new module to lie plugged ui as long 
as it fonfoniis to the specification. We achieved a higher- 
peifomiance system with the system approach than we 
could have achieved with an interface specification. 

The !/0 diiver cells for the different modules were designed 
togettier and SPICE-sinmlated iteratively to get the best per- 
formance. Since short distances are important, the pinouts 
of tiie modules are coordinated to minimize uraiecessaiy 
crossings and to minimize the w^orsi-case bus pat lis. See 
page 22 for more infomiation on the electrical design of the 
Runway bus- 

The bus can be faster if there are fewer modules on it, since 
there is less total length of bus and less capacitance to drive. 
The maximum configuration is Uiniied to six modules— -four 
processors, a dual I/O adapter, and a memory controller — to 
achieve the targeted frequency of 120 MHz- 

Another optimization made to achieve high bus frequency is 
the elimination of wire-ORs. By requiring that only one n\od- 
ule drivc^ a signal in any cycle, some traditionally bused sig- 
nals that require fas! response, such as cache coherency 
check status, are duplicated, one sei per nvodiUe. Other 
bused signals that do not require immediate response (e,g., 
error signals) are more cost-effertively transfonned into 
broadcast transactions. Adapting the Runway protocol to 
ehminate %%ire-ORs allowed us to boost the bus frequency by 
10 to 20 percent. 

Acknowledgments 

The audioi^s \Nould like to acknowledge tlie contilbutions of 
many mdi\1duals who participated in the defuiltion of the 
Rmiway bus protocol: Robert Brooks, Steve Chahners, 
Bany Fliihive, Da\id Fotland, Craig Frink. Hani Hassoun, 
Tom Hotchkiss, Larry^ McMahan, Bob Naas, Helen Nusbaiun, 
Bob Odineai. John Shelton, Tom Spencer, Brendan Voge^ 
John Wickeraad. Jim ^^Ttliams, John Wood, and Mike Ziegler, 
In Midition, we would hke to thank the various engineers 
and managers of the Computer Technology* Laboratoryv the 
General Systeii^s Laboratory, the Chelmsford Systems Labo- 
ratorj', and the Enguieering Systems Laboratory w4io helped 
design. \^eii^; build, and test the processors, I/O adapters, 
and memoiy controllers necessary to make the Rimway bus 
a reality, 

Eeference 

L R Sienstrom, "A Siuvey of Cache Cohei-ence Schemes for Multi- 
processars," IEEE Cojnpuier. Vol 23, no- 6. June 1990. pp. 12-25. 



24 Februarj* 1996 Hewjett-Packard Joiim al 



)Copr. 1949-1998 Hewlett-Packard Co. 



Design of the HP PA 7200 CPU 



The PA 7200 processor chip is specifically designed to give enhanced 
performance in a four-way multiprocessor system without additional 
interface circuits. It has a new data cache organization, a prefetching 
mechanism, and two integer ALUs for general integer superscalar 
execution. 



by Keniieth K, Chan, Cjtus C. Hay, John R. Keller, Gordon P. Knrpanek, Francis X. Schiimaeher^ and 
Jason Zheng 



Since L98G, H€»wlett-Packard has designed PA-RISC^^'*^ pro- 
cessors for its lechiiicaJ workstations and serv-ers, commer- 
cial servers, and large multiprocessor traiisiietion processing 
machines/^^ The PA T2W processor chip is an evolution of 
the high-perfonnance stngie-cliip superscaiai' PA 7100 design. 

The PA 7200 incorporates a number of enlian cements specif- 
ically designed for a gluelcss four- way mukiprucessor system 
\\ith increased peifoniiEmce on both technical and commer- 
cial applications *^^^^ On the chip is a nuilliprocessor s^ratem 
bus ituerface which connects direclly to the Runway bas 
described in the article on page IS. The PA 7200 also lias a 
new data cache organization, a prefetching mechanism, and 
two integer ALUs for general integer superscalai' execution. 
Tlie PA 7200 artwork was scaled dov^Ti from the PA 71O0's 
O.S-micrometer HP CMOS26B process for fabrication in a 
0.55-micrometer HP CM0S14A process. 

Fig, 1 shows the PA 7200 in a typical symnwlric juultiproces- 
sor^stem conllguration and Fig. 2 is a block diagram of the 
PA 7200, 

Processor Overview 

The PA 7200 VLSI chip contains all of the circuits for one 
processor in a multiprocpssor system except for external 
cache arrays. This includes integer and floating-point execu- 
tion units, a 120-entry fully associative Irmislation lookaside 
buffer (TLB) with 113-bIock translation entries and haj'dware 
TLB miss support, off-chip instruction imd data cache ituer- 
faces for up to 2M bytes of off-chip cat^he. ^m assist cacite, 
and a system bus interface, ^flie floaling-point unit in the 



PA 7200 l5 the same as that in the PA 71 00 and retains the 
PA 71O0's 2-cycle latency and fiilly pipelined execution of 
single and double-precision add, subtract, multiply, FMPYADD, 
ai\d FMPYSUB instiiictions. The instniction cache interface 
and integer unit are enhanced for superscalar execution of 
integer instruction pairs. The bus interface and the assist 
cache are completely new designs for tlie PA 7200, 

In addition to the performance features, the PA 7200 con- 
tains several new architectinral features for specialized 
applications: 

Little endian data fonnat support on a per-process basis 
Support for luicacheable memory pages 
Inc revised tnemoo' pag^ protecrion CD (PID) size 
Load/store ^'spatial locality only'* cache iiint 
Coherent 1/Osuppori. 

The CPU is fabricated in Hewlett-Packard's CM0S14A pro- 
cess with 0.55-micrometer devices and three-level metal 
interconnect technology, Tlte processor chip is 1.4 by 1.5 cm 
in size, contains 1.3 million transistors, and is packaged in a 
540-pin ceramic PGA. IEEE 1 140 J JTACi -compliant boiuid- 
ary scan protocol is included for chip lest and fault isola- 
tion. Fig, 3 is a photomicrograi>h of the PA 7200 CPU chip. 

Instruction Execution 

A key feature of the PA 7100 that is retained in the PA 7200 
is ail execution pipeline highly balanced for both high-fre- 
quency operation and very few (compared to most current 
microprocessors) pipeline stall cycles resulting from data. 



ProGOSSOf Module 



I 



Bus 
Convener 



Proce$5or MaduJe 



InslrucliDii 
Cachu 
SnAMs 



Hwfl HB'H 



T 



Runway Bus 



M Binary 
CflRtrollef 



Fig. 1. Tlip PA 720tj processor in a 
t>T>icaI sjimnetric ruulti processor 
ays tf I 111 f :i>ririgi ira t i o n . 



Febnifljy 10fl6 tlewtett-Paelifitrtl Mnmni] 25 



)Copr. 1949-1998 Hewlett-Packard Co. 




System Bus Interface 



I 



^ 




D-caclte 

Data 
SRAMs 


"^ 


y w 



Runway Bus 164 Bits Wide) 



17 Addtess T 

_^ _ 1 



Zt Tag 

A-^ ► 



IS Allilfiss 




Fig, 2. Block diagram of liie 

PA Tim CPU 



control, and fetch dependencies.^^ The only conimon pipe- 
line stall ijeniiilties are a one-cyf:le load-use interloi k for aiiy 
cache hit, a tJiie-cycle penalty for the innnediate use of a 
11 oati tig-point result^ a zero-to-one-cycle penalty for a mis- 
predicted brail ch, and a one-cycle penalty for store-load 
combinations. Tlie PA 7200 improves oii the PA 7100 pipe- 
hne hy removing the penally for store-store com bin a lions. 
Tliis was achieved by careful tinting of ofi-chip Sli.\Ms, 
wliich are cycled at the full processor fi^e<iuency. Removal of 
the store-store penalty is [janicnlarly helpful \o\' rode that 
has biu'sts of register stores, such as the code typically 
foimd at procedure calls and state saves. 




The [\4 7200 featmes an integer superscalar implementation 
geared to high-frequency operation similar to the PA 7100LC 
processor J^ In a superscalar piocessor, ntore than one in- 
struction can be executed in a single clock cycle. When two 
instructions are executed eaeli cycle, this is also referred to 
as bundlmg or duahissuhig. hi previony PA 7100 processors, 
only 'i^ float itig-point operation coukl be paired with cUi inte- 
ger oiieralion. The PA 7200 adds the ability tt) execute two 
integer operat ioits ijcr cycle. Tins will benefit tuatiy applica- 
tlons that do not have intensive floating-point operations. To 
support this integer superscalai' rajiability, the PA 7200 adds 
a second mteger ALl'. two extra read ports and one extra 
write port m tlic general register stack, a new predecodmg 
block, a ne%¥ instruction bus, additional register bypassing 
circuits, and associated controJ logic. 

Instruct ions are classi:fied into three groups: integer opera- 
tions, loads and stores, and floatmg-jDomt operations. The 
PA 7200 can execute a pair of it^tnictions in a single cycle if 
they are from different groups or if they are both from the 
integer operation group. Branches are a special case of inte- 
ger oi>erations; they can execute with the pi^ecedhig mstruc- 
1 w Ml but not witli the succeeding instmclion, Donble-wortl 
alignment is jiot required for instructious executmg in tlie 
same cycle. As in the PA 7100, only floating-poiitt operations 
can htm die across a cache line or page boundar ies. The 
PA 7200 can also execute two hist ructions writhig to the 
same target register ut a smgle cycle. 

The PA 7200 contains three instmction buses that comiect 
the instniction cache interface to two integer ALUs and a 
fioating-pomt unit. As in the PA 7100, an on-chip double- 
word instruction buffer assists die biuidlmg of tw'o instruc- 
tions that may not be double- word aligned, t^n evei^ cycle, 
one or tivo instrtictioiis can come from any of four sources 
(even or odd instructions from the cache, or even or odd 
instructions front the on-chip buffer) and can go to any of 
the three destiJiaiion buses. 



Fig. -i. 



26 FeJinmry 1 996 Hewlett-Pac kard J u 1 1 n ■ ; j i 



)Copr. 1949-1998 Hewlett-Packard Co. 



Tlie process by which multiple instruciions are dispatched 

to different inslmction buses leading to corresponding 
execution imits is called steering. The PA 72(N) has a very 
aggressive tiriung budget for steering and iiisTniciion decod- 
ing (done in less than one processor cycle ); tlierefore, the 
steering logic must be fast In addition, on evety cycle, the 
control logic needs to track which one or two of the three 
instmction buses contain v-alid insCructioas wi well as the 
order of concurrently issued instructions. To avoid ha\1ng 
superscalar steering and execution decode logic degrade the 
CPl' frequency, six predecode bits are allocated in tlie in- 
struction cache for each double word. Data dependencies 
and resource conflicts are checked ancl encoded in prede- 
code bits as instructions are moved from nienioiy hito the 
cache, when timing is more relaxed. These six pretlecode 
bits are carefully designed so that tliey are optijiial for l>oTh 
the steering circuits and the control logi*; for projier jJtj>e- 
lined execution. Thanks to the oplinii^ied design and invple- 
nientiilion of titese predecode bits and the associated steer- 
ing clrcitits mul control logic, this path is not a speed-limiting 
patii for tlie PA 7200 chip iuu\ does tiot oljstnuf its higli- 
frequeticy operation. 

To niinimize area, shifl-nierge and test conflition units are 
not duplicated in the second ALU. Thus shifts, extracts, 
deposits. £ind mslTUCtiotis using tite test condition bloc^k ai-e 
limited to one t>er cycle. Also, instnjctions with test condi- 
tio! ls trfumot he bundled with integer operations or loads or 
stores as tJieir successors. A modem compiler can niiiuniize 



the effect of these few superscalar restrictions through code 
scheduling, thereby aUowing the processor to exploit inucii 
of tJte in^mction-level parallelism a\'ailab!e in application 
code to achieve a low average CPI (cycles per instruction). 

Data Cache Orgamzation 

Fig, 4 show.s I be PA T2tH)'s data cache organiziitiun. The cJiip 
cfjnttiins an interface to up to IM bvtes of off-chip direct 
!Tiapped data cache consisting of industr>'-st<indjird SRAMs, 
Tlie ofF-chip cache is o'f^l^c^ ^it the full pmcessor frequenc>^ 
and has a one^ycle latency 

Tlie chip also includes a small fully associative on-cJiip assist 
cache. Two pipehne stages are associated with address gen- 
emnon, tianslation, and cache access for l.)oth cacht^, which 
results in a maximum of a one-cycle load-use penalty- for a 
liit in either cache. The on-chip assist cache combined with 
the off-chip cache together form a level-1 (*ache. Because 
this level-1 cache is accessed in one prot^essor cycle and 
supports a large cache size^ no leveU2 cache is supported. 
The ability to access the large off-chip cache uith low latency 
greatly reduces the CPI component associated with cache- 
resident memory references. This is particidiirly helphd for 
code with large working data sets. 

The onnthip assLsl cache consists of 64 fully associative 
■J2-byte cache lines. A content-addressable memory (CAM) 
is ust^d lo match a tnmslated real line address with cacOt 
entry s tag. For each cache access, 65 entries are checked 



TIB 



T 



VinuBl Address 



PbyiicflJ Address Vimf»1 Adtft^$ 



Assist 



Bt^icil Tog 



Cache ^ 



12 Byte Dat» Uftff 



Assist Cache Features: 

* 2K-BvtG FuEly Associative Organization 

* Single-Cycle 3?-Byie Cache tine Write 

* Single-Cycle Cachs Line Read 



Data to/Vrom Funciional Units 



Mem t^fiche 



Physical Tag 



3Z'Byt« Dau Line 



Mcfnory Interface 



Main Cache Feaiures: 

* 4K-Byte1alM Byte size 

* Single-Cycle Loaifs 

* Pipelined Single-Cycle Stores 

* Hashed Virtu ai I ndflxing 



Fig, 4* PA 721X1 data c^ic lie 
miimnvMitm. 



Fi^btmry BM Hewlett-Packatrl .km mal 27 



)Copr. 1949-1998 Hewlett-Packard Co. 



for a valid match: 64 assist cache entries and one off-chip 
cache entry. If either cache hits, the data is retumt^d directly 
to the appropriate fujictional unit with die sanie latency. 
Aggressive soif-tinied logic is employed to achieve the 
timing requirements of die assist caclie lookup. 

lines retiuested from memory as a result of either cache 
misses or prefetches are initially moved to the assist cache. 
Lines are mo\ ed out of the assist cache in first-in, first -out 
order Moving lines into ttie assist cache before moving 
them into the off-chip cache elinunates die thrashing behav- 
ior typicaJly associated witii direcr mapped cadies. For ex- 
ample, hi the vector calculation: 

fori: = OtohJ do 
All): - B[il + C[il + D[i] 

if elements Ati], B[tJ, CEi], and D[i] map to the same cache index, 
then a direct mapped cache alone would tluash on each 
element of the calculation. Tliis woulti result in 32 cactie 
misses for eight itei^ations of this loop. Wi\h m^ assist cache, 
however, each line is moved into the cache system Avithout 
displacing the others. Assuming sequential 32-bit data ele- 
ments, eight iterations of the loop causes only the initial 
four cache misses. 

Linger caches do not reduce this type of cache thrashing. 
WliiJe modem com}:jOers are ot'ten able !o realign data stiuc- 
tures to reduce or eliminate tlirashing, sufficient compile 
time information is not always available in an application to 
make the coiTect optimization possible. The PA 7200's assist 
cache eliminales cache tlnashing extremely weO with mini- 
ma] tiaixlwai^e and without compder optimizations. 

Lines that are moved out of the assist cache can condition- 
ally bypass the off-chip cache and move directly back to 
mentor^'. A newly defined spatial locaiih/ only hint can be 
specifieti in load and store instaictions to indicate that data 
exhibits spatial locality but not temporal localitj^. A data line 
fetched from memory for an insiniciion containing the spa- 
tial locality hint is moved into the assist cache like all other 
lines. Upon replacement, however, the line is tin shed l^ack 
to memoiy instead of being moved to the off-cliip cache. 
This mechanism allows laige amoimts of data to t>e pro- 
cessed witliout poUuting the off-chip cache. Additionally 
cycles are saved by avoiding one or tw^o movements of the 
cache line across the 64-bit mterface to the off-chip cache. 

The assist cache allows prefetches to be moved hito the 
cache system in a smgie cycle. Prefetch returns aie accmnu- 
lated independently of pipeline execution. Wlien the com- 
plete line is available, one data cache cycle is used to insert 
the line into tlie on-chip assist cache. If an ii^struction that is 
not nsing the cache is executing^ no pipeline stalls are 
incmred. 

Because the assist cache is accessed using a translated 
physical address, it adds an inherently critical speeci padi to 
tiie cliip microarchitecture. An assist cache access consists 
of virtual cacJie address generation, trans tati on lookaside 
buffer (TLB) lookup to translate the virtual address into a 
physical address, and finally the assist cache kjokup. Tiie 
TLB lookup and assist cache lookup need to be completed 
in one processor cycle or 8.3 ns for 120-MHz operation. To 
meet the speed requirements of this path a conibmation of 
dynamic and self-timed circuit techniques is used. 



The TLB and assist cache are coif^iiiei of content-address- 
able memory (CAM) structures, which differ from more typi- 
cal random-access memory' (RAM) structures in diat they 
aie accessed with data, wliich is matcheti widi tlata store<l 
in tlie memory, rather than by an index or address. A typical 
RAM structure can be broken into two halves: an address 
decoder and a memoiy array. The input address is decoded 
to determine whicli memoiy element to access. Similarly, a 
CAM has two parts: a match portion and a meniory aiTay, In 
the case of the assist caciie, tlie mat.cli poUion consists of 
27-bit comparators diat compare the stored cache line tag 
m\\\ the translated physical address of the load or store in- 
struction. Wlien a match is detected by one of the compaia- 
torSj then that comparator dumps the associated cache line 
data. 

Fig. 5 shows the timing of im access t(j die TLB and assist 
cache. Tliis single 8.3-ns clock cycle patli is broken into mul- 
tiple subsections using self-tuned circuits. An access begins 
when the single-ended \1iluai address is latched and con- 
verted to complementary predischarged values VADDR and 
VADDR in the TLJ3 ad^iress buffer on tlie rising edge of CK. 
These dual-rail signals are then useti to access the C.'VM 
aiTay, A dunuiiy ( -AM anay access, representing the worst- 
case tuning tlirough the CMI array, is used to initiate ttie 
TLB RAM access. K any of the CAM entries matches die 
VADDR, then the completion of the dummy CAM access, 
represented by TLB READ_CK , enables tlie TLB read control 
circuits to drive one of the TLB RAM read lines. The p re- 
charged RAM array is then read and a differential predis- 
charged physical address is driven to the assist cache. 
A suTiilar access is then made to the tissist cache CAM and 
RAM structines to produce data on the rising edge of CK, 
A precharged load aligner is used to select the appropriate 
pari of (he 256-bit cache line to drive onto the data bus and 
to perform byte swapinng Tor i>ig-to-liltle-endian data format 
conversion, .\lthough this padi contains tight timing bufigets, 
careful circuit design and physical layout ensme diat it does 
not limit the processor frequency. 

Tlie basic stRicUire of the external cache I'einaias unchanged 
ft^om the PA 7100 CPU. Separate instniction (I) and data (D) 
caches aie employed, eacli connected to the CPU by a (H-hit 
bidirectional bus. The cache is viitually mdexed and physi- 
cally tagged to mmimize access latency. Tlie I-cache data 
and tag are addressed over a common address bus. lADH. 
Tile D-cache tJata has a separate address bus, OADH. mid the 
Dnjache tag has a separate address bus, TADH. Used m con- 
junction with an internal store buffer for write data, tiie spUt 
D-cache address allows liigher-band width stores to the D- 
cache. Instead of a serial read-modify-WTitc, stores can be 
pipelined so that TADH can be employed for the tag read of a 
new store instruction while DADH is used to write the data 
from the previous store instructiom 

As in the PA 7100 CPU, the PA 7200 CPU cache interface is 
tuned 10 work v^ith asynchronous SRAMs by creating special 
clock signals for optimal read and write timing. The cache is 
read \^ith a special latch edge that allows wave pipelining^ 
tliat is, a second read Ls launched before tlie fo'st read is 
actually completed. Tlie cache is written using two special 
clocks that manipulate the viiite enable and outjmt enable 
SRAiM controls for a mhiimuni total wtixb cycle rime. 



28 Febfuary 1996 Hewtett-Paekard Jcmmal 



)Copr. 1949-1998 Hewlett-Packard Co. 



Vlrui&l 
Address I 



Dual Rarl 



OC 



Cache Mess Adtfress 
and Page Offset Address 

MADDII - MADDR 




CK 



Dummy Read 



Ceche Clack 



Cec^e fl£AD_Clt 



J 



CacheRAVlieBdL 



PiiECHai 



1 



J L 



PftECHG2 
Fig. 5. ?A 72(10 TLB arid assist c^ciiB Unmjg. 

tlie design team worked closely with several key SRAM 
vendors to develop a specification for a (>ns SRAJVi with 
enlianced write speed capabililies. These new SRA^Ms allow 
both of the caches to operate at the CPU clock frequency. 
The CPU can be shipped with e<iual-sized mstmction and 
data caches of up to IM bytes each. As in the PA 7100 CPl^ 
a read can be fmished [n one clock cycle. Htjwever, to match 
the bandwidth of the Runway bus and to increase the perfor- 
mance of store-intensive appUr atlons, a significant timing 
change was n)ade to iin|)rD\ e the banriwidiJi for writes to 
rhe cache. The PA 7S00 CPU acliieves a quasi-single-cycle 
wnte: a series of N writes requires N+1 cycles. The one- 
cycle overhead is required for turning the bus around from 
read to write, that is, one cycle is required to turn off tlie 
SRAM drivers and allow the CPU drivers to take over. No 
penalty is incurred in transitioning from write to read. 

Prefetching Mechanisms 

AsigniricaiU amoiiru of execution inne is spent waiting for 
data or instructions to be returned from memoiyH In an 



HP 9000 K-class system naming transaction processing ap- 
plications, an average of about one cycle per instniction can 
be attributed to the processor wailing for memory. Tlie total 
CPI for such an apphcation is about 2, Execution time can 
therefore be greatly reduced by reducing the number of 
cycles tlie processor spends wait ing for memory'. The 
PA 7200 mcorporates hardware anci software prefetcliing 
mechanisms, which initiate memory requests before the 
data or instnictions are used. 

Instruction Prefetching, Tlie PA 7200 implements an efficient 
UTstruction prefetcli algorithm. Instniction fetch requests 
are issued speculatively aliead of the insmiction execution 
stream, Muitii^le instniction prefetch requests cari be in 
fUgbt to the memory system simultaneously. Issuing multiple 
prefetches aliead of the execution stream works well when 
linear code segments are initially encountered. This instruc- 
tion prefetching scheme yields a \M perlonnance speedup 
on transaction processing benchmarks. 



)Copr. 1949-1998 Hewlett-Packard Co. 



Febmaiy 1996 He wlett Fackerd Jn umal 2 9 



Data Prefetching. The PA-RISC instniclion set mcludes a 
class r>f iiistnirlioiis ilial mcKlify the b;^t» satue in a geiieraJ 
register by aii iniiiiediaie tiisplac anient or general register 
index value. Aji example is LDWX.m rl(r2),f3. The LDWX (load 
word indexed) instruction with a modify completer (,1^1) 
loads the value at 1 he address contained in register r2 into 
register r3. and tJien ad<is rl to r2 (i.e., foad r2 -> r3; rl + r2 -> 
r2). The PA 7200 ran use this class of instructions tc3 spec:u- 
late what data may soon be accesvsed by the code stjeanu If 
the load r2 in the above example is a cache misSt a prefetch is 
issued to tJie afldress calculatefl by 1 he l>ase register modifi- 
cation (rl \ r2). The PA 7200 u^viri this base register modifica- 
tion to speculate where a future data reference will oceui. 
For example, if rl contains fine 0x40 and r2 contaias line 
0x100 ajid no lines ^u-e initially in (he cache, then this in- 
stniction initiates a request for line 0x100 in !'est>onse to the 
cache miss and hne 0x140 is prejetcfved. If the hne 0x140 is 
later used, some or all of the cache miss penalty is avoided. 

When a line is prefelched, it is moved into the assist cache 
antl tagged as being a prefetched line. When a prefetched 
line is later referenced by the code stieam, another prefetch 
is launched. Continuing witli the above examtile, if this load 
instruction were contained m a loop, on the first ite ration of 
the loop lines 0x100 kuid 0x110 would be requested trom 
memory* On the second iteration hne 0x140 is referenced. 
The assist cache detects this as I lic^ fii st reference to a pre- 
fetched line and iJiiliates a prefett h of hne 0x180, This 
allows memory requests to stay ahead of the reference 
stream, reducing the stall cycles associated whh memoiy 
latency. 

The PA 7200 allows four data prefetch reqitesLs to be out- 
stanfiing at one time. These prefeiches can l>e used for 
either prefetches along intiltlt^le data reference streams or 
farther aliead on one data reference stream. Returning to 
tiie vector example, 



fori: 
API: 



to M do 
Bill f CfiJ + D|i] 



each new cache line entered wiU cause four new^ prefetch 
requests 1 1 * be issued: one for each vector On the other 
hand, if the processor were doh^g a i>lock copy: 

for i : z to iSi 
Alf] : = Blij 

then it could prefetch two lines ahead of each reference 
stream. 

Reducing Average Memory Access Time 

A nnmber of fe^itiires hav*' l>een combined in the PA 7200 to 
mh^mize tlie average memoo^ ai^cess lime (the a\ erage 
munber of cycles used for a nienioiy reference) J '^ These 
featiu-es togetlier provide excellent performance speed tips 
on a tnimber of applications that stress the memory' hierai- 
chy. Pig. ti compares the perfoiinance of the PA 7200 and the 
PA 7100 on a munber of technical l>enchm£uks. To minimize 
the average memoi^' access rime associateti ^vith cache hits, 
the laige low-latency off-chiji cache from the PA 7100 design 
has been retained and enhiincenients itiade to allow single- 
cycle stores. The PA 7200 improves on the PA 7100 by reduc- 
ing cache misses by minimizing coinpulstiry. capacity, and 
conflict cache misses. 




gnusianfO sinsy-i rmsker milzQf %wm2X tomcalv 

U PA7200 S'lZOMHz 

Fig, (>, A numher of features Oiat niiuiniize the average nif'nior>' 
ac<jass time allow the PA I'ZOO CPU to ouiperform its predecessor 
ihi' PA 71 00 on technical benchniarks. 

The PA 72t)0 reduces conflict misses by adding effective 
associativiiy to entries of the main cache. This Ls done with- 
out the overhead required for a large multiset associative 
cache. Traditionally cac:hes have been characterized as di- 
rect mapppd, miiltiset associative, or fully associative. The 
PA 7200 assist cache effectively adds dynamically acljusted 
associativity to main caclie entries. As miss lines are 
brought into the assist cache, tlie entries with the same 
cache index mapping in the main cache iire not immediately 
replaced. This allows inuKiple cache lines with the saine 
Index to reside in "the cache" at the same time. All assist 
cac^he entries cmi be 11 lied wit h lines tliat map to tJie same 
off-chip cache index, or ti^ey can be filled with entries that 
map to various in flexes. This eliminares the flisastrous 
Utrashuig that can occur wilJi a direct mapped cache, as 
discussed earher. 

The PA 7200 reduces comfjulsoiy cache misses by prefetch- 
ing lines that are likely to he used. Wlien the software has 
the mformation necessary al compUe time to anticipate what 
data is needed, die base register modi ticat ion class of load 
and store instnictions cmi be used to direct ijrefetchmg. If 
no specific direction is addetl to code or if oid code is being 
run, theii l)ase register modifying loads and sl.(5res crni still 
be used by the hardware to do effective i)ref etching. The 
processoi^ can also be configured to use loads antl stores 
that do not modify base registers to initiate speculati%-e 
requests. Because memory bandwidth is limited, care was 
taken to mininiize the amount of bad pief etching while max- 
im tzing tlie speedu|j re^dized by issuing jnenioi>' requests 
sjieculatively. Both old code traces and new comi>iler opti- 
mizat ions were investigated to deteniune tlie best set of 
prefetching rules. 

hi adthtion to the laige caches supported by the PA 7200, 
capacit>' misses aie reduced by selectively allocating hnes to 
the off-chip cache if they benefit from being moved to the 
off-chip cache. More effective use can be made of a given 
cache capacity by only moving data that exliibits temporal 
locality to the off-chip cache. The assist cache provides an 
excellent location for use-once data. The spatial locality 



3 Pc^bniSdry 1 0O6 He wlett-Packard JoiiniaJ 



)Copr. 1949-1998 Hewlett-Packard Co. 



only (.SL) hmt associated with load aod sf ore instructions 
allows code to identify' which data is use-once (or simpiy too 
large to be effectively cached), thereby re<lucing capacitj' 
niisses. The ,SL lilnt is encoded in pre\iously resenTd load 
and store instruction fields. Large analjtic applications and 
block move and clear routines achie% e exceUeni si>eedups 
from this new cache hint 

Bus Interface 

Tiie PA 7200s Runway bus interlace is carefully tuned to the 
requirements and capabilities of the processor core. Tlie 
interface has sevenil features that minimize transaction la- 
tency, reduce processor cost, and take ad\ aiitage of particu- 
lar attributes of the CPU core to simphf>^ interface design. 
Tlie bos iiif efface contains a cache coherence queue and 
transaction buffers, arbitration logic, and logic to support 
multiple pit>cessor'tO"bus-fre(]ueticy ratios- The bus inter- 
face also implements an efficient doiibh^ .woopt aigorif hm 
for coherent transaction nianagenient. 

The PA 7200 connects directly to the Runway bus without 
transceivers or interface chips. Without this layer of exter- 
nal logic, system cost is reduced wluie ijerfornituu^e is in- 
creased because of lower CPl J-to-bus latency, Spe< lal sys- 
tem and circuit designs allow the Runway bus to nut at a 
frequency of 120 MHz while maintaining connecti\ity to six 
loads. Negative-hoid-time receiver design and tight skew 
control prevent races when driveiTs imd receivers operate 
fi'om the same clock edge, A read transaction is issued in 
one bus cycle aitd the Hj24j>1e meuTory return is transferred 
m four cycles, resulting in a peak siLStaitiahle hiutd width of 
768 megabytes per secontl Tj take advaiuage of die high 
bus bandwidth, the PA 7200 can have up to six memory 
reads in flight at the same time. 

To juinimize rvM\ transaction latency, die PA 7200 asserts 
and raptures arbitration signals on the half cycle (phase), as 
sho\v7i in Pig, 7, The processor cure conmuuiicales its intent 
to initiate a transaction in the fust [>hase, allowing the inter- 
face to assert, its btis arbit ration signal on I lie second phase, 

t A snoop, also (tncwn as a cache cohetertcy checJ^, \s the action perfarmed by alt proc^^ssors 
amJ I/O sdapTeis when ttiay utisecve a coherent ifans&ctian rs^ued bv an^iher modi/Se Each 
module Derfamiinfl Itre snoop must check "ts cache for the address at \hn current trans^cEmn. 
and iF fctinct, change the state cif thatcochR acJriiess Ciaclfie state transstjons are desrhbed in 
the article an page IB 



The 1 ransaction address Infomi^on, only available on the 

third phase, is then forwarded from the processor core to 
the bus interfat^e. In tbe common case where there is no 
contention for the Rimway bus, the address is driven onto 
the bu.s in the next cycle. Read and write buffers, included in 
the bus interface to decoitple die CPl' core in case arbitration 
is not tnunediately won, are bypassed in the common case 
to reduce latency, 

Trajisactions from tlie read and write buffers are issued by 
the bus int**jface with fctecl priorides, SncK>i> data has the 
MgJiest priority, followed by read requests, then the write of 
cache \ictims. When the inemor>" controller cannot handle 
new read requests and tlie read and write buffers are full, 
the bus interface will issue tlie i^Tite transaction before the 
read to make best tise of the bus bandwidth available. 

Since transactions on the Rmiway bus are always accepted 
(and never rejected or retrietl at Llie ex|;)ense of bus band- 
width), each processor acting as a third paity must be able 
to accept a burst of coherent transactions, Smce tliere are 
times when tlie CPl' core is busy and cannot accept a snoop^ 
the bus interface implements a ten-transaction-deep queue 
for cache snoops anti a fhree-iransact ion-deep queue for 
TLB snoops. With deei) colierency (itieues, a large rmniber of 
folieient transactitins from sever;il iirotessors can be out- 
standing \\athout tlie need to invoke ilow control, 

Processor-to-bus frequency ratios of 1:1, 3:2, and 4:3 are 
jjroxided for higjter-frequency processor upgrades. Using a 
ratio iilgorithm that retiuires the bus clock to he synchro- 
ntjus witli the processor clock ensures that the ratio logic 
tloes not impail synchronization delays ty^^ical of systems 
with asynchronous clock domains. For any ratio, the worst- 
case delay is less than one CPU clock cycle, and in the best 
case^ data transmission does not incur any delay, 

Ttj minimise processor pipeline stalls resulting from multi- 
processor interfereiure, tratisactjons at il\e head of ihe co- 
lierency qtietie are forwaided to tfve CPU core in two steps, 
Hrst. the core is sent a hghtweight tjuerx which steals one 
cycle of cache bandwicJth, A low-latency response is recei^^d 
from Uie off-^iii|j mid assist caches. Only when a cache state 
niuditlcation is required is a second full-ser\1ce ijuer>' for- 
warded to the CPU core. Since the vast m^ority of cache 



Phase 1 



Phase 2 



Phase 3 



Phase 4 



Phases 



Phase 6 



CPU Com: 



Bus 
Interface: 



Transaetion 
Request 

Enable 

Arbitration 

Bypass 



Jlu ntAfgy Buy. 

m f 

ARB_Oin 
ftODR^DATA 



Read 

Address 

Available 

Enable 

Read 

Bypass 



Assert 

Arbiii^Blion on 
Runway Bus 



\ 



I 



f 



Calculate 

Address 

Parity 

Latch Drive 

Arb itrat i ens , Add re ss o n 

Calculate Rti nway %m 
Winner 



\ 



f 



\ 



\ 



Read Tfansjictioii 



J 



Fig- 7. Wilii iiiiiiierous by|)ass 
paths, latency between the CPU 
core and ttie Bunsvay bus is mini- 
mised. As 80011 as tht^ CPU dt*ta(?ts 
a ciaclie ini^s, ihf* l^ujs iutt?rf^t,e yr- 
bitrate.s for the system buss in half 
a cycle. As soon as tJie cactie miss 
adt]!'e.vfs is av^ital>le, it is routed ta 
itu* interface in lialfa cycle, where 
iis bus t>arity is generaleti in an- 
oiher tiaJf-{;y(Ut\ Pf*rlormatK'e Is 
niiiximized in the connnon ca^e of 
Itiile bui^ iniffic. vvluni tluH,.Ttf 
wins bus arbilmlion inmiediaLely. 



)Copr. 1949-1998 Hewlett-Packard Co. 



?>bniary iyii(i 1 lewlen-I^ackiird Jdiiiiial 3 1 



snoops result in misses, this double snoop appraacli allows 
the PA 7^00 to achieve liiglier multiprocessor performance 
without the added cost and complexity of a dual-ported 
cache or duplicate cache tags J'* 

PA 7200 Circuit Translation 

Mo5l of the PA 7200 circiiil designSt artwork, and physical 
design nietlvodologj' are based upon tuid leveraged from the 
PA 7100 CPU, which was designed using HP's CMOS26 IC 
proeesSn toolSj and libraries. However, aggressive perfor- 
niance and cost goals required that Hie PA ?2f)0 l>e falDricated 
using tlie faster, denser CMOS I J JC pr occurs also under 
development. To completely redesign and lay out existing 
PA 7100 circuits for the CMOS 14 process would have been 
an inefficieiU use of resources and would have gready ex- 
tended the tlesign phase. Therefore, the entire* PA 7200 was 
designed using die existing CMOS26 teclinologv^, and tlie 
artwork was then automatically tiansiated t (j £ind reverifled 
in the CMCJSI 1 process, 

I'nfortuiiately, automatic translation faced two global issues. 
First. CMOS2t) is a 5.0V fnonuaal) jirocess hut CMOSI4 was 
origuially specified for 4.0V operation. Simulations showed 
that the speed of a few common cin nit topologies did not 
s<;ale Unearly into tlie target tecfmolog>' because of the low^er 
supply voltage. Dettdled mvestigation by the CMOS 14 devel- 
opment group concluded that raising the supply voltage by 
10% was feasible ant! the process was fully qualified for 
operation at 4,4V. Tliis was sufficient for these circuits to 
n^eet the speed improvement goal. 

Secondly, CMOS2G layout rules do not scale miiformly into 
the respective rules for CMOS14, since each component of a 
process teclinology has different physical ^uid manufactuiing 
constrahUs. A simple gale-shrink algorithm, which only re- 
duces FET effective gate length, could have provided a 20% 
transistor speed improvement. Without overall area reduc- 
tion, the extra PA 7200 functionality dictates a die size much 
larger than the PA 7100 and this approach would result in 
slow^er wire speeds and a shai"p mcrease in manufacturing 
cost- With aggressive scaling, a more complex translation 
algorithm, and a limited luimber of engineering ac^justments 
to the layoul and eiccinc al niles. the CMOS 14 version 
achieves a 2ffK3 ovei'^ill speetl improven^ent along with a ^18% 
jiower refine lion from the original CMOS26 design. 

Translation Methodology, Tlie metiiodolog^* that was deveU 
oped acconmiodates CMOS26 designs and translated 
CMOS 14 artwork in parallel is generjilly traaisparetU, and 
merges srtitjothly with the exist ing flesigti environment. A 
liieriLrchical (hlock-level) translation niethodolog^v was cho- 
sen because it pi'ovades many advantages aver the more 
traditional Oat (mask-level) translation, hnportant reasons 
for selecting this approach were: 

' Algoritlim flexibility. Tlie optimal tninslation algorithm is 
not rei]uire<l to guaiantee tjiai every pathological CM 0826 
layout, and more important, all existing PA 7100 blocks are 
translated to a legal CMOS 14 layout as long as a manager 
abie number of violations result and are easdy correctable 
by hand. IlieiTirchical methods unply editing only unique 
instances of a violation at the block level, rather than the 
entke set on a flattened mask 



• Design modularity. Having parallel hierarchies containing 
both CMOS26 and CM0S14 blocks enables add hional flexi- 
hihty. Translated artwork can he read directly by the front- 
end editors for elcctrieal sinmlation and other puriJOse.s. On 
the top-level routing blocks. CMOS 14 layouts using a tigluer 
metal pitch were a necessary aiteniative to the inmslated 
CMOS2(> versions. 

• Concunent mediodologj'. TiaiLslated artwork is available 
for n^ask generatit^n along with the original block. Flat 
translation is serialised and ftir complex algorithms implies 
a costly delay after each design release. Moreover, ha\ing a 
complete, hierarchical CMOS 14 artwork database allowed 
subsequent chip revisions to be released usmg incremental 
chcmges n^de directly to tl^e CM0S14 ari:work. 

Many operations in the translation algorithm are compli- 
cated by hierarchical junctions (these would disappear Vkith 
a flat translation.) A luerarchical junction is any conneclion 
betv\Ten objects in sepaiate blocks. If indi\'idnal art wtjrk 
features touching or extending beyond hierarchical boimd- 
aries are further shrunk by a fixed distance after being re- 
di:ced liy I he scaling coefficient, gaps wWl occur at die par- 
ent junctions thai < ajmot always be filled automatically. 
A subt lo l)Ut more troublesome scaling problem is caused by 
snapping ilie location of child instances to the gild resolu- 
tioUf which creates shape misalignn^ents or gaps at parent- 
child or child-child junctions if origins round in a cUfferent 
directioji. Tliis effect can be cumidative, and becomes signif- 
icimt for jutictions fliat span multiple hierarchical levels* 
Increased database size and consistency checking are other 
drawbacks of a b lock-orient eti translation, 

A final check was added after CMOS 14 layout verification to 
hierarciiically compare ports, signals, and connectivity be- 
tween the CMOS26 and CM0S14 artwork netlists. Tliis was 
necessaiy since hand corrections made to the translated 
CMOS 11 layout could introduce new design errors. 

Translation Algarithm. Any scaling coefficient should ensure 
that all minimum widtlis, spaces, and exact-size shaj^es from 
CMOS26 be translated to CMOS 14 such that each edge pair 
snaps to the grid resolution (0.tJ5-pm) in the same direction. 
There are several natural solutions to ensure that Ltj-|jtn 
(drawn) mitnminu features in CM()B26 always become 
0.6-pjii minimum features in CMOS 14. For example: 

• Scale by a - 0.8 and tJien furtiier shrink intercoimect by 
0.2 pm. 

• First slirink interconnect by 0.2 pm and then scale by 
a = 0.75. 

The second oprion is only practical tor UI>taiT^' blocks smce it 
is too aggressive for intercomiect with minimimi contacted 
pitch and pro^-klcs less margin for die effects of ujieveii grid 
snapping. Tlte delidled algoritlim is based upon the first op- 
tion, vtlth additional manipulations of n-weil regions. FET 
gate exteasions, cuntacl sizes, interconnect contact encio- 
sine, and interlayer contact si)acing. These operations have 
pamsitic effects which can create notches and tuirro^ cor- 
ners and are usuaOy conectable by autoniaticaily filiing new 
width and spacing \iolations. 



32 Peljmaiy t&96 tt^^wtett-Parkard .loiim ; 1 1 



)Copr. 1949-1998 Hewlett-Packard Co. 



There were still a residual number of geonietncal cases that 
could not be fully translated by any reasonable tool or heu- 
ristic. In these cases we either waived the layout rules 
where margin was available or made extra efforts to repair 
rule \ioladons by hand. Although many of these notations 
did occur, the vast mayoritj' resulted cither from the hierar- 
chical phenomena described eariier or from fimdamental 
scaling issues with certain contaei sinicmres and latch-up 
prevention rules. In no case was any sigmfican! block relay- 
out required, however 

Scaling-Sensttive Circuits, .^though algonthmir trajisiadon of 
PA 72CK) circuits generally improves electrical performance 
aj\d decreases pai^asitic effects, tliere are a few exceptional 
circuits v^ith different characteristics. In general, these were 
abnormally sensitive to transistor sizing ratios, noise caused 
by coupling, voltage shifts caused by charge sharing, small 
variations in processing parameters, or the retluccd -^AV 
high le\^pL Additionally. Total resistance in the third layer of 
metal can increase after irai^lation and cause routing delays 
to improve less than tite basic scfaling assumptions predict. 

Summary 

The design goal for the PA 7200 was to increase the perfor- 
mance of Hewlett -Packard cnn^piuer systems on real- world 
applications in a variety of markets while n^aintaining a higit 
degree of price/perfonnance scalability and a low system 
component comit. General application performance is im- 
proved through an increase in operating frequency, a second 
integer ALU for enlianced supei'scalar execution, and im- 
prove tl store mstniction performance. For applications that 
operate on large tlata sets, such as l^^jicai ^malytic antl 
scientific applications, tiie hardwiu-e prefetching algorithms 
and fully associative assist cache implemented in the 
PA 7200 provide excellent perfonuaitce increases. In addi- 
tion, I be processor includes a high-biuid width, low-latency 
multiprocessor bus inierface to snpiion rcjst-elTective, high- 
pcifomiance, one-way to fom-way iiujl!i[)roces.sor .systems, 
wluch are ideal for tec^hnical or commercial plathnms, with- 
out additional interface chips. Additioniilly, tlie PA 7200 is 
scalable from desktop workstations to many-way multipro- 
ce^or corporate computing platforms and supercomputers. 

Ac kn Q wledgments 

The aulhtjrs would like to thank all the differpnt teams who 
contributed to tlie successful design of tiie PA 7200 chip. 



The design and implementalion was primarily done by many 

indi\iduals fron\ the Computer Technology' Laboratory and 
the Cupertino Systems Laboratory in Cupertino* California 
and several individuals from the Engineering Systems Labo- 
mtor>* in Fort Collins, Coiorado. -Vlany thanks also to se\'eral 
teams whose work was responsible for many key design 
decisioits made. This includes ilie Iniegmted Circuits Busi- 
ness division in Fort Collins and the Systen^ Performance 
Laboratofj' and the California Languages Laboratory^ in 
Cupertino. 

References 

1- M J. >Mvoii, ei aL ""Hewku-PHackard Precision Architecture: The 
Processor,* Hewiett-Puckafti Journal. Vol 37, no. 8^ August 19S6, 
pp. 4-2 L 

2. EB. L-ee, **Precjsion Archiiet'ture,*' IEEE Cmnputm', VoL 22, no, 1. 
Janimr\^ 1989. pp. 79-91. 

3. P. Knehel, et al, •'HP's PA7100LC: A Low-Cost Superscalar PA- 
RISC Processor/ Coin;p€on Digest afPapet^^. February 1993, pp. 
441^7. 

4. E* DelanOt et al, ''A High-Speed Superscalar PA-RISC Frocessorr 
Conipcon Digest ofFapem, Febniarj' 1992. pp. 1115-121. 

5. M. Fors^lh, et al, "CMOS PA-RISC Processor ibr a Nevi^ Family of 
WorkstaiioTis," Compvon Digest ofPapet^, February 199L pp, 
202-207. 

G, D. Taiiksal\'aJa, ot al, "A 9(VIVTHZ CMOS RISC CPU Dt^signeU for 
Siist^iied Performance*," ISSVC Digf^st of Tech tt teal Papeifi, Febru- 
ary' 1990, pp. 52-53. 

7, B.D. Boschnia. et al. A 30-MIPS VLSI CPIV ISSCC Digest of 
Tech fiicai Papers, Febnjar>^ iBS£l pp. 82-83. 

8. J, Tetter, i^t al, "A 1d-MIFS 32b Micrf^irocessor," IS^CCDigeM o/ 
Tfrhniral Papeis, Febmar^' 1987. pp> 26-27, 

y. D. Fotlaiid, et al, **Hai"dware Design of the l^lrst IIP Precision 
Architecture Computere." Heirleft- Packard Jourmtt, Vol. 38. no. 3, 
March 1987, pp. 4-17. 

10. G, KuHJanek, el ;il, TA 7200: A PA-RISC Proces^sor with Inte- 
grated High-Periorn\ance MP Bus Int4*rface,'' Comjmm Diamt qf 
Prtfff^i^, Febnjai>' 1994, pp, 37r>aS2- 

1 1. E. KasMii, et al, ^'A VMOS RISC (PC with on <hip Paiiillel 
C^^he," ISSCC Digest ofTechniral Papers, February 11^J4. 

12. T Asprey, et al, ""Performance Features of the PA 7100 Micro- 
I>rDcessoi," IEEE Mhra, Jmie 1993, pp. 22-^5. 

13. J. Hennessy and \}. Paiteistm, f'tytHpuin'Atvhimiure, A Qumh 
t f Iff t i re Appnmeh , M o i gai i Kait f i n an n Pi \ b I if^h<*rs, 1 Vf^\X 

11 K. Chan, et al, "Multiprocessor Features fjf the \{? Con^orate 
Busifiesj? Servers," Cum pea h Digest of Papers, February 1993. pp. 



Ff bruiuy 109C I It- wlett-Packard Jo nn\a\ 33 



)Copr. 1949-1998 Hewlett-Packard Co. 



Verification, Characterization, and 
Debugging of tlie HP PA 7200 
Processor 



To guarantee a high-quality product the HP PA 7 2 DO CPU chip was 
subjected tn fnnntlnnal and electrical verification. This article descfibes 
the testing methods, the debugging tools and approaches, and the impact 
of the interactions between the chip design and the IC fabrication 
process, 

by Thomas B, Alexander, Kent A, Dickey, Da\1d N, Goldberg, Ross V. La Fetra, James R. McGee, 
Nazeeni Noordeen, and Akshya Praka^h 



The complexity of fiigital VIjST chips has grown tlranistically 
ill recent years. Rapid advaiK es in integrate<i circuit process 
t(?chnolog>' have led to ever-increasing densities, wlticl^ Jvave 
enabled designers to design more and more functionality into 
a single cliiiJ. Electrically, the o]>eratlng frequency of these 
VLSI chips lias also gone up signiilcantly. This h;is been a 
result of tlie increasetl speed of the tr^msLstors (CMOS trj;m- 
sistoi's are conuitonly called F'ETs, for field effect tran.sistors) 
m\d the 5rcf that tlie circuits are closer to each other than 
before. All this has had tremeuduus benelhiT in terms of per- 
formance, size, and reliability 

The irtcreased complexity of the VT^SI clhps has created new 
aJid more complicated problems. Many sophisticated tech- 
niques and tools are being developed to deal with this new 
set of problems. Nowhere is this better illustrated than with 
CPl'S, especially hi design verification, both fundi onal arid 
electrical. \\T:iile design has always been the focus of atten- 
tion, %^eritl cation has now become a veiy chaLtenging and 
critical task. In fact, verification acti\ities now^ consume 
n^ore tune iuul resources than design iind are the real litt lit- 
ers of time to m^trket. 

On the functional side, for many years now it has been im- 
possible 1o couie even close to a complete check ufiill pos- 
sible states of the chip. The t^htUicnge is to do intclhgent 
verification fl>oth presilicon and postsihcon) that gi\x^s verj- 
high confidence thai the ilesigu is correct and that the final 
customer will not see any prubieru. On the electrical side, 
the challenge has been to Onrl the weak links in tlie design 
by creating tlie right set of envhonn\ents aiul tests t hat ace 
mosi likely to e^tpose failures. The increased complexity' of 
the \T^I chips has also made isolation of a failme down to 
rJie exact FET oi* signal an increasingly chfficult task. 

Tliis paper presents the verification methodology, tech- 
niques, ainl (fjols that \^-ere used on tlie HP PA 7200 CPf to 
guarantee a bigh-qtudity juotluct. fig. 1 shows die PA 7200 
CPll in its iiin-grkl airay package. The paper describes the 
functional jind electrical vtnification of the PA 7200 as well 
as tiie testing tuetliods, tlie debuggini^ tools luid ajjproacbes. 



■{aaSi!l|ie Impacl of die interactions between the cliip design 
BA^ the IC fabiication iirocess. 

Functional Verification 

The PA 7200 VPV underwent ijitensive design verification 
effoils to ensure die (quality arid conectness of its bmction- 
ably These verification effoits were an uitegial pai1 of tJie 
CPU design process, VeriJjcation was performed at all stages 
in tJie design, and eacli stage imposed its own constxaints on 
the testing possilile. There were two main stages of func- 
tional verification: the presilicon implementation stage and 
the postsilicon prototyping stage. 

Presilicon Functional Verification 

Since the design of the PA 7200 was based upon the PA 7100 
CPV. we chose to use the same modehng language and pro- 
priet£ii3'' simulator to model and verify its design. Duiing the 
iuijTlenientatirm stage a detailed simulation model was built 









^^^^^^^^^^^^A^ ^ 


'^r^i'^^ 


U 




^^■^1 


m 



Fig, I, The PA 72t)U CPU m its piii-gnd array package, with the lid 
I itiioved to reveal the ctiip. 



3 4 Febniai>' 1996 Hewlett-Par kard Jounia 



)Copr. 1949-1998 Hewlett-Packard Co. 



to verify tile correctness of the des^. Early in the imple- 
mentation stage, software beha\ioraJ models were iised to 
represent portions of the desi^ ^id these were increnTen- 
tally replaced by detailed niodeb. A switch-level model was 
also used late in the inipieinentation stage to ensure equiva- 
lence bet^ een the actual design implementation and the sim- 
ulation model. This switdi-le%'el model was extracted from 
the physical design s FET artwork netjists and was used in 
the final regression testing of the design. 

Test cases were written to provide thorough fujictional cov- 
erage of the simulation model Tlie test case strategy' for the 
PA 7200 was to: 

• RujT all existing cases derived for previous generations of 
PA-RISC pro<;essors 

• Rmi all architectural verification programs (A\Ts) 

• \S'rite and nui test suites and cases directed at specific fimc- 
tionai aieas of the implementation, including the newly 
designed multiprocessor bus interface f Runway bus) and its 
control unit, t he assist cache, the dual-issue control unit, 
and other unique functionality 

• Generate and nm focused random test cases that thoroughly 
stress and vary processor state, cache state, multiprocessor 
activities, and timing conditions in the various ftmctional 
units of the processor. 

Existing legacy test cases arul AVPs targeted fur other gen- 
erations of PA-RISC processors often had to be converted or 
redirected in a sensible way to yield interesting cases on the 
PA 7200. Additional test cases were generated to create 
complex interactions betw^een tlie CPI; functional units, 
external bus events, and the system state. An inteniaQy de- 
veloped automated test case generation program allowed 
verification engineers to generate thousands of mteresfing 
cases that foe used ujion and stressed particular CPU units 
or functions over a vaiiety of normal, unusual, and bounchui^^ 
conditions. In addition, niiuiy specific cases \vere generated 
by hand to achieve exact timing and logical conditions. 
Macros were written and a macro tireprocessor was used to 
facihtate higl^ pr'oducljvity in generating test case conditions. 

All test code was run on the PA 7200 CPU model and on a 
PA-RISC architect ural simulator and the results were com- 
pared on an instruction-by instruction has is. The test case 
generation anel simulation process is show^\ in Fig. 2. A PA 
720(^Sl)ec ifH' version of the PA-RISC arcliitectmal simulator 
was developed to provide high coverage in the areas of multi- 
processor-spetnfic conditions, ordering rules, cache 
move-in, move -out rules, and cache coherence. Some por- 
tions of the iolemal CPU control model were also compared 
with the architect oral simulator to allow^ proper tracidng 
and checking of implenientat ion -specific actions. Since thte 
PA 7200 was designed to support several processor-to- 
system -bus frequency raiios, the simtilation enwonment 
was built ttj facilitate nn^nitig tests at various ratios. 

The architectied state of the CPU and simulator^ inciucUng 
arf'hilectcfi rrgisttTs. caches, mid TLBs, was initialized at 
niodel startup thne. Traces of uisti iiction execution aitd rele- 
vant architected state from the CPU model and frotn the 
PA- R 1 SC sim I il al < w w ere com pared . Tliese tiac es in elm le< 1 
disassc^mhled code. anVcted register values, and releviuu 
Inad/sfore or address in Format i<m, providing an effective 
guide to I' debugging problems. 



AeloiBstic 
Hmttmut 



T 



Hamf- 
GesefMed 



Arcltrieciural 

Stale 
InrtiallfaiitHt 



Macm 
Pnocessof 



Assembly 
Code 




; I" 



Fmal Stale 
Compare 



Fig» 2. F^A 72G0 test case generatioji and simulation process. 

A test bench approach was used to model other system bus 
components and to verify proper system and multiprocessor 
behavior, including adherence to the bus i^rotocol. Tlve test 
bench accepted test case stinnilus to stinuilate and check 
proper CPI" operation. Multiprocessor effects on the cac^hes 
and the pipeline of thc^ CPI" licing tested were checked in 
detail both by instruction execution comparison and by final 
state comparison of arctutected registers, caches, and TLBs. 

The bulk of the testing dining the implementation stage en- 
tailed running assembly lajtguage test vectors on the simula- 
tion model Tlie principal limitation of this stage was the 
limited execution speed of the simulation model. 

As components of the simulation model became defitied and 
mdividually tested, they were combined into increasingly 
liu^ger corttponcnts imtil a coml>ined sinnilation model was 
built for the entire computer system including processors, 
memory, and 1/0* 

An efffjrl was alsfj made to evaluate the test c"ase coverage 
of processt^r rt>ittrol logic to ens me tiial we had liigh cover- 
age of the fuiictjonal miits with normal and conier-crase con- 
ditions. Eturing our regressions of fimctionai simulation, the 
simulation model was instnimented to provide coverage 
data, which w^ls postprocessed to yield coverage metrics on 
the cot Urol logic. 



)Copr. 1949-1998 Hewlett-Packard Co. 



Futmiiiry 1 096 Iti? wl+?u Par kard J qu ri la! 35 



This verification effort consumed many engineering months 
in test bench development, test case general ioiu and test 
checking. Billions of sinnilatlon cycles were run on scores 
of high-perfonnance technical workstations during nights 
and weekends in several different geoj^raphical locations. 
The result of this effort is a high-quality CPl' tJtat booted its 
operating system and enabled postsilicon functional and 
electrical verification efforts soon after the first silicon parts 
were received. 

This simulation apprdiii^ibsi£i^iSSi^;ii^^|^^ de- 
bugging and regression testiii^ enviiiiminimffV^^ fixes. 
Specifically, when making a conecdon to the CPU, tj'ie simu- 
lation enviromnent alkjwerl die vpiilication team to nm re- 
gression suites that stressed the faulty aiea, jDroviding more 
complete simulation coverage of the problem. 

Fostsilicon Functional Verification 

Despite the massive presilicoti testing, there are ahvays 
bugs that are found (jnce the first hardware becomes avail- 
able. Bugs that affect all chips regardless of tempei-atin-e^ 
voltage, or fretiiientry are termed fimctionaJ bugs. Bugs that 
are made worse by environmental conditions or do not 
occur in all chips ^u'e termed electrical problejns, and the 
test strateg.y for finding those problems is detailed in the 
section on electrical verification. 

Testing machines as complex as the HP 9000 J-class ajid 
K'Class systems, tjie first systems to use tl)e PA 7200, was a 
laige effort invoking dozens of i>eople testing specific areas. 
This section ^\ill describe how the processor design labora- 
tory created processor-focused tests to tlnd processor btigs. 
Many other people contril^uted to testmg other portions of 
the systems with intentional overlap of testing coverage to 
ensure high quality. 

Because of the compiexity of modem processor chips, not 
ail bugs ai'e foimd in presilicon testing. The processor is so 
comphcated that adequate testing w^ould take years nimiing 
at operational speed to hit all the interesting test cases. Pre- 
silicon testing is orders of magnitude too slow^ to tut that 
many cases in a reasonable amount of time. Thus, when 
presilicoti testing st0i>s finding bugs> the cltip is manufac- 
tured and postsilicon testing commences. However, finding 
bugs is not rs simple as just turning the power on attd wait- 
ing for the bugs to appeal. One probleni is deciding what 
tests to run to look for failures. Poorly written tests will not 
fimi bugs that customers might find. Anotiier problem is 
delmgging failures to their root causes in a timely manner. 
Knowing a problem exists is a great stai1, but sometimes 
discovering exactly w hat has gone ^Tong in such complex 
systems cart be ver>^ difficult. Postsilicon testmg loses much 
of the obser\'abiIity of processor state that was easily 
obtained in the sunulation enviroimtent. 

To provide high co\ erage of design features, three testmg 
tools w^ere prepaied to stress the hardware, Tliese tools 
were softwaie programs used to create tests to run on the 
prototj^e macliines. Eacdi tool iiad its own focus and in- 
tended overlap witli other tools to improve coverage. All 
tools had a proven track record from running on previous 
systems succcssfiilly. To ensure adequate testing, two tools 
were heavily modified to stress new features in the PA 7200. 



All of the tools had some feattires in common. They all ran 
standalone independently on prototype machines under a 
small operating system. Because They did not njn under the 
HP-UX-^^ operating sys ten;, much better machine control 
could be achieved. In addition, not rteeding the HP-UX oper- 
ating sj^stem decoupled hardware fiebugging from the soft- 
ware schedule and let the hardware laboratoiy find bugs in 
a timely nuuiner. (Later in t lie verification process, HP-UX- 
based system testing is performed to ensure thorough cover- 
age. However, the team did not rely on this to find hardware 
problems.) All fiardwaie lest tools also had the ability to 
generate their own code sequences and were aJl self-c beck- 
ing. Often these code sequences were randomly generated, 
but some tools supported hand-coded tests to stress a par- 
ticular hardware feati^re. 

Uniprocessor Testing 

Even though PA 7200 systems support up to four processors^ 
it is desirable to debug any uniprocessor problems before 
testing for the much more complex multiprocessor bugs. 
The fii^t tool was leveraged from the PA 7100LC effort to 
provide kno^it good coverage of uniprocessor functionality. 

Tltis tool operated by generating sequences of pseudoiandom 
instructions on a known good maciiine. like aii HP 9000 
Model 735 workstation. On this reference machine, a simula- 
tor would calculate the correct expected values and then 
create a test to be nm on the prototy^De hardw are. This test 
would set up hmidreds of various initial states and run the 
prepared sequence. Each time it ran tlie sequence, the tool 
w^ouid determine if it got the correct result and display any 
differences. Since much of the w ork w^as done on another 
maclune to prepaie the correct answ^cr, this tool was veiy 
robust and was a good initial bring-up vehicle. It also could 
run its seqtiences ver>' quickly and give good coverage in a 
short anK)unt of time. Howeven uniprocessor bugs rantped 
down ver>^ quickly, and so this tool was used much less after 
inirial bring-up. 

Multiprocessor Testing 

The verificatiort team was especially concerned with multi- 
processor bugs, since ex^^erience indicated that they are 
much more difficult to find and debug tiian uniprocessor 
cases. Tiiese complex hugs weie often fomid later in tl\e 
pioject . For this reason, die tw^o other tools used were 
heavily modified to enhance PA 7200 testing for multipro- 
cessor comer cases. 

The fiist nmltiprocessor-focused tool attempteci to do ex- 
liaustive testing of the effects of various bus transactions 
interacting with a test sequence. The interference transac- 
tions were fixed but were chosen to hit all the cases that 
were considered interesting. The test sequence could be 
randomly generated or written manually to stress a particu- 
lar area of the processor. 

To determine if a test operated properly, the tool woultl nm 
the test seqitence once v^ithout any interference fironi other 
processors. It would capture the machine state after tIus nm 
(register, cache> memory) and save it bs tJie reference re- 
sults. The tool did not need to know^ w^hat the test was doing 
at all — it simply logged whatever result it got at the end. To 



3 6 Febru^iy 1&96 Hewlett-Packard Jouin o I 



)Copr. 1949-1998 Hewlett-Packard Co. 



test multiprocessor interference transactions, the tool 

would then arrange to have other processors tr\' all combi- 
nations of iiiieresting transactions selected to interact with 
the sequence. This was accomplished by running the test in 
a loop with the interference transactions being moved 
through everii^ possible cycle in a predetermined timing win- 
dow. This exhausdve testing of mterferejice transactions 
against a cixie sequence pro\ided kno\%Ti good co\'erage of 
certain troublesome areas, WTien there were failures, many 
useful debugging clues were a\^aiiable regarding which 
cases passed and which cases failed to help in tracking 
down the btjg. 

The main deficiency of this tool was that it relied on high 
uniprocessor functionaUty. If there were bugs that could 
affect the reference run, the tool would not be able to detect 
them. Thus, this tool could not run until imiprocessor fmic- 
tionaiit> wa^ considered stable. As it turned out, the initial 
PA 72CK) silicon had ver>^ high uniprocessor fimctionaIit>' and 
so testing began on initial silicon. One adviiniage of this tool 
over the miiproceasor tool was that it could run for an tm- 
limited amount of time on tlie prototype hardware and gen- 
erate test cases on its qwyl Tiiis abilit>' made running this 
tool much simpler than the uniprocessor tool 

The final too! was the backbone of the functional Aeriftca- 
tioji effort. In many ways, this tool merged the gootl ideas 
from other tools into one program to provide high coverage 
with ease of use. This tool generated sequences of pseudo- 
random Lnstiiictions on each processor ran the sequences 
across all the processors simultaneously, and then clieckerl 
itself. It calculated the correct results as it created the in- 
stioiction sequences, so it could nm miat tended for an un- 
limited time. The sequences and interactions it could gener- 
ate were nmch more stressful than I he other tools. Much of 
this ability cajue from I be effort put into expanding this tool, 
but some of it came from basic design decisions. 

This tool relied completely on pseudorandomly generated 
code sequences to find its bugs. The tool took tirobabilities 
from an input file which chrec^ted the tool to stress certain 
rn^eas. These focused tests enhanced coverage of certain 
processor funciiontility such as the new data prefetching 
ability of the PA 72tX). Almost any parameter that could be 
changed was changed constantly hy this tool to hit cbsvs 
beyond what the verifjcation team coiihl think of. Having 
almost no llvt-d iJaranierers allov^'ed this too! to liit bugs that 
no other tool or test has ever hit. 

This final tool received adciitiona! modifications to lest DMA 
(direct memory access) betv^'een perijjheral c-ards and mem- 
ory. Tlie new Runway lius added new biLs protocols invoKing 
1/0 transactions, which the processor needed to obey to 
ensure system correctness. D^LA was usetl to activate Uiese 
bus protocols to verify that the PA 7200 operated properly 
To make sure these extra cases were well-covered. DMA 
was perfornietJ using various peripheral de\ices while the 
processor testing was done. This extra testmg was worth 
the investment since several bugs were found that might not 
ha\^e been caught otherwise. 

The postsilicon verification effort was considered success hi I 
because tlie team found almost every bug Ijorore other 
groups aJid couki comiminicatc workm'oiinds U)Y hardware 
problems to keep them from affecting softwai'e schedules. 



The operating system testing actually found veiy' few pro- 
cessor bugs, and aU serious bugs were found by the postsih- 
con hardware \-erification team. Some of the later hardware 
bugs found may never be encoimtered by the current operat- 
ing sv^stem because the compilers and the operating sj-stem 
are limired in the code sequences they emit However, the 
hardware has been verified to the tK>Lnt that if a future com- 
piler or operating sj^stent uses a feature not used before, it 
can in all likelihood do so wittioui encountering a bug- 

Electrical VerificatioB 

Electrical verification of a VLSI de\'ice is performed to guar- 
ajitee that when the product is shipped to a custonier. the 
device wiW function properly over the entire opei"atmg region 
specified for tiie product. The operating region variables 
include ambient temperature, power supply voltages, and 
the clock frequency of the VLSI device. In addition, electri- 
cal verification must accoimt for integrated circuit fabrica- 
tion process ^^ariation over the life of the device. Testing for 
sensiti\ities to these \^ariables and improving the design to 
account for them improves fabrication yield ajid increases 
tlie marghi of the product. This section describes the vari- 
ous electricaJ verificatioti activities performed for the PA 
7200 CPU chip. 

Electrical Chaxaeterization 

Electricai cbaracteriKat ion refers to the task of creating dif- 
ferent test environnients and test code with the goal of iden- 
tifying electrical failures on the cliip. Once an electrical fail- 
m-e is detected, characteiization also includes detennining 
die chaiacteristics of the failure hke dependencies on volt- 
age, temperature, and frequency. 

Electrical failures may manifest themselves on one, several, 
or every chip at some operating point (temperature, voltage, 
or frequency} of the C¥\\ Electrical faihires cause the chip 
to malfunction and typically liave a root cause in some elec- 
UicaJ phenomenon such as the familiar hold time or setup 
time \iolation. As chip operatuig frequencies increase, other 
electrical phenomena such as coupling between signals, 
charge sharing, aiKl iinforesee^i mterclup circuit interactions 
increasingly become issues. 

To ensure a high level of quality, various types of testing and 
tesi envirotmients are used to check that all electrical fail- 
ures are detected and corrected before shipment to custom- 
ers. Dwell testing and shmoo resting are two types of testing 
techniciues used to characterize cliips. 

P'or the PA 7200, dweU testing involved running pseudo- 
random code on the system for extended periods of time at 
a given voltage-temperatu] e-freqiiency point. Since the test 
code patterns are extremely hnportant ibr electrical verifica- 
tion. dweU testing was used to guarantee that the pseudo- 
random code would generate sufficient patterns to test the 
CPU adequately. 

Shmoo testing involves creating voltage-frequency plots 
f shmoo plots) by running test code at maity voltage- 
frequency-temperatiu'e combinations. Fig. 3 shoivs a typical 
style of shmoo plot. Tliis pint is for a failing chip that has 
some speed prohlenis. By exaitiiiui^g tlu"^ shajie of the siunoo 



)Copr. 1949-1998 Hewlett-Packard Co. 



Fpbniar>' 1998 Hewlett-P^kard Journal 37 



5.D0O 



4J84 



4731 



4.598 



^ 4.465 



4.332 



4.139 



ioee 



3.993 



3800 




20.000 18.6B1 17.363 16.044 14.7Z5 !3.407 12a8B 10.769 9.451 8.000 
Period (ns) 

Fig* 3, V'oltage -frequency shjnoo plot. 

plot, much ctm be learned about the design of the chip. Volt- 
age-frequency-temperature points wetL beyond tlie legal oper- 
atltig range should be included in the shmoo plot. It is not 
sufficient to rely only on tlie nuninuim allowed margin (in 
temis of voltage- frequency-temperature} to determine if the 
design is robust. The test, code run for creating yhinoo plots 
is extremely unportant. Simple code car\ create a false sense 
of quality. 

Testing Enviroiiments 

There were four main testing environments for the PA 7200: 
system eharacterixation, chip tester characterization^ pro- 
duction chai-acterization, and functional character! station. 

System Characterization, This testing is focused on running 
tlie CPV in tlie acmal system aiKl altering the operating vari- 
ables to detemiine the cliaiacteri sties of the design. Tlie 
variables that are involved here ai^e test code, amhieiit ien\- 
peratitre, voltages (internal chip voltage, I/O pad voltage, 
and cache SRAM voltage), frequency of the cMp. types of 
CPll chips (varialioRs in manufactuiing process), types of 
cache SRAMs (slow \^ersus fast), and system bus speed. 
Various types of test code are run on the system, including 
pseudoraiidom FA- RISC cade, HP-ITX application code, and 
directed PA- RISC' assembly code. 

Chip Tester Characterization, Tliis testing consists of riuimng a 
set of cliips processed \\ith different manufacturing process 
I'aiiahles on a VLSI chip tester over ranges of temperature, 
voltage, and frequency, using a set of specific tests written 
for the PA 7200. The chip tester can nm any piece of code at 
opeiatuig frequency by providing stimulus and perfonning 
checks at the I/O pins of the cliip. Testing is accomi jlislied 
through a nuxture of parallel and scan medio ds using a VLSI 
test system. Tile may only of testing is done with at-speed 
parallel pin tests* Tests vnitteti in PA-RISC assembly code 
for the PA 7200 that cover logical functionality and speed 
paths are converted tluough a simulation exti action process 
into tester vectors. Scan-based tests are used for circuits 
such as standard-cell control blocks and PLA structures, 



wMch are inherently difficult to test fully using parallel pin 
tests. Tliese parallel tests are run on the tester well beyond 
the operating speed of the chip. 

Production Characterization. .AJl PA 7200 chips go through a 
set of tests on th«^ chip tester Since a large number of chips 
are ni^mufactured for ijjolotyping puri)oses, the results of 
the nonnal miinufacturing testvS are very valuable for charac- 
terizalion. This testing provides characterization data for all 
the thips that are manufactured with a set of specific tests 
written for the PA 7200 over rai^ges of temperature, voltage, 
and frequency. Pai allel and scati tests written for the PA 
7200 are run within the operating range tjossjble on cus- 
tomer systents as well as in welhdefuied regions of margin 
outside this operaliiig range. This type of testing over all tlie 
chips shows electrical failures that could happen if there are 
variations ui the manufacturing process over time. 

Functional Characterization. Tliis testing involves running 
pseud or andonily generated tests on the system at the nomi- 
nal operatuig poijit for very long periods of time (months). 
Even though tliis testing uses code environments targeted 
for functional verification, it can be very effective in detect- 
mg electrical issues. This iype of testing can often find any 
test cases (circuit paths) that liave not been covered in the 
prior three t>i}es of testing and will reduce the cltance that 
the customer will ever have any electrical problems. 

Debugging 

W\en a problem is seen within the operating region of the 
chip, the problem must be debugged an<l fixed. Tests are nin 
well beyond the operating region to look for anomalies. 
Failures outside tlie operatuig region are also understoofl to 
make sure that the fajhue \^ill not move into the operatuig 
region (with a different enviroimient, test, or manufacturing 
process shift). The root causes of these electrical problems 
need to be characterized and luiderstood. In the chai'acter- 
ization of the problem many chips are tried in various envi- 
ronments to understand the severity of the problem. To un- 
derstand tl\e cause of the failure, the tesi cocie is analysed 
mid converted to a small directed test wltli only the peilirtent 
failing seiiuence. This is necessary to limit the scope of live 
investigation. Then the problem is further analyzed on the 
ciiip tester The chip tester can nm any piece of cocJe at 
speed but it can rmi only reasonable sizes of code because 
of the amount of tester menion^ Tlie tester can perform 
types of experiments that the system cimnot pro\ide. such 
as varying the clock cycle for a certain period of time. This 
process is called phu.se strelching (see Fig. 4). Often the 
failing path can l>e fletemiined at this point basetl on piiase 
stretching ex|3erinients. Various other techniques can also 
be used on the tester to isolate tjie failing path. Once the 




Clock 



Fig. 4. TiiTiiiig diagram of a phase streulipcl elDck, T]w nonnul. 
period (T) of the clock Is shown In cycles 1.2, and 4. The i^ormal 
phase time is T/2. In the second phase of cycle 3, the phast^ is 
stretd\ed by time A for a total phase titne of T/2 4- A. 



38 February 1 996 Hewlett-Pac kard Jou mil 



)Copr. 1949-1998 Hewlett-Packard Co. 



failing path is isolated, the electrical failing mechanism needs 
to be understood. \'^ariotis tools are used to determine the 
failing mechanism. 

One method to help identify the failing mechamsm is to use 

an electron-beam (E-bean\J scoping tool on the chip tester. 
In this process, the failing test is run in a loop on the tester 
and inl**ma! signals are probed to look at wavefoims and 
the relationships between agnals. It is very sinular to using 
an oscjJJoscope to look at a signal on a printed circuit board 
except that it is done at the cliip level. 

As final confirmation of the failing mechanism, tiie failing 
circuit is modeled by the designer. The electrical compo- 
nenls of the circuit patii are extracted and simulated with a 
circuit siniulaior (SPICE). Hie modeling needs to be accu- 
rate to reproduce the failin-e on the simulator Onc;e the fail- 
ing mechanism is confirmed in SPICE, a fix is developed and 
verified. 

Since a chip turn to determine whetlier the fix wiU work and 
Uiat the fix has no side effects takes a long time, the fix can 
often be verified mth a focused-ion-beani (FIB) process. 
FIBing is a process by which the chip s internal cormections 
can be modified, thereby changing its behavior or function- 
ahty. hi the FIB process, metal wiies can be cut, or joined by 
metal deposition. FIB is an extremely valuable tool to verify 
Qxe^ before implementing tltem in the actual desigit. 

After the electrical failing mechanism is understood, addi- 
tional work is done lo create the woi^t-ciise lest for this 
failure. The insight gained from imderstanding the root cause 
allows the test to be tailored tfj excite the failing niechanisni 
more readily. Tliis can cause the test to ftul more often, at a 
lower or higher frequency, a lower or higher voltage , or a 
lower or liigher temperature. Developing a worst -case lest is 
an extremely important step. The extent of the original prob- 
lem cjuuiot be understood until the woret-case test is devel- 
oped, hi chiding die worst-case test in the t>roduction screen 
ensLires that parts shipped to customers will never exliihit 
tlie failure even mider varyuig operating conditions ;uk1 the 
most stressful hardware and softw^are environments. 

Tliese points can be illustrated with a case study. The nonti- 
nal operating [joint of the PA 7200 is 120 MHz at a Vdd of 
4.4 volts. Li tiiis particular example, a failure occtured while 
running a j jseudorandom test at B. 1 volti^ and 120 MHz at 
high temperature (55''C ambient). Even though the PA 7200 
is not required to opei'ate at this voltage the verification 
team did not expect this failure. Tlius, tJiis problem needed 
to he characterized and itjs root cause underslootl. 

In this example, this chip was the only one that failed at 
6 J volts. However, a few other chips failetl at even higlter 
voltiigcs. This problem was worse at higher frequencies and 
higher temperatures. The test code that was failing was con- 
verted from t>seuikj random system cotle lu tester code. Next 
the test code was run on the tester kmxl analyzed. Since this 
problem dU] not occur at lower frequencies, each phase of 
the ci* K'k in the test was stretc*hed to deternune wliicrh clock 
phase made the chijt pass or ftiil. The interrui! siitie of the 
t:hip was also dumped out on the tester using serial scan. 
The fail in jt and passing intenial scanned states were com- 
jjared tn s( e which stales were improperly set^ This helped 
to isolate the failing path. Once this was done, the failing 



path for this failure was analyzed to understand the electri- 
cal failing mechanism. For this problem, E4>eani was used 
to understand the failing mechanism. 

Fig. 5 shows the circuit thai was failing in this debug^g 
example. Hie circuit is a latch with the signal tMXH control- 
ling the transfer of data into the latch. When LRXH and CKTN 
(clock) are true (logic i ), the latch is open and the in^ erted 
le\el of the input RGV gets transferred to the output HM2. 
Wlten LRXH is false (logic 0), the latch is closed and the out- 
put HMZ holds its state. Fig, 6 shows the waveforms of the 
internal signals that were captured through E-beam. The last 
two signals, CKIN and CK2N. are the two-phase clock signals 
on the chip. The passing and failing wa\'eforms for LRXH and 
HM2 are shown at the top of the figure. The passing wave- 
forms for LRXH and HM2 are called LRXH/47V@lrd^0ns and 
HM2/4.7®lrd^0ns, respectively. The failing wa\'eforms for LRXH 
and HM2 are called LRXH/47V@lTd - 1.0ns and HM2/4J@lrd - KOns, 
respectively. Tlie input signal RCV (not shown in the figure) 
is 1 during the fii^t two pulses of LRXH sho\\ii, dming the 
third pulse, and 1 thereafter The output HM2 is expected to 
transition from to 1 during the third LRXH pulse and stay 1 
until the foiuth pulse. However, the slow faHtng edge on LRXH 
causes a problem. In the failing case, on the third LRXH pulse, 
HMZ transitions front to 1 but the slow failir\g edge on LRXH 
also lets the next input value of RCV ( 1 ) propagate to the 
output KM2. HM2 therefore transitions hack to 0. Li the pass- 
ing case, the falling edge of LRXH arrives a little earlier and 
tiie output HM2 maintains what w^as captured in the latch (1). 
Once the failing mechanism was understood, the worst-case 
test was developed. In this case study, the worst-case test 
caused mmiy parts to fail at nominal conditions. The failing 
mechanism was modeled in a circuit model t>y I he designer. 
Once this was done, a fix was developed. FIB was used to 
verify the fix. Tlus faikire mechanism was llxed by speeding 
up LRXH by adding a buffer to the long route of LRXH. Fig. 7 
shows how this was done. The Hgnre Is a photograph of a 



VDO 



UtXH 

± 



CXIN 



RCV 



i 



i 



vm 



-^ HIH 



h 



M» 



^hJ 



Ffg* 5. A latfrh circuit that wbs falling at 5.1 volts. 



)Copr. 1949-1998 Hewlett-Packard Co. 



FebniHiy 1996 Hewli/Ui^ karri Jrmrna] 39 



ScDps Copj^rignt (c) acnUfiticiergiiir Tec^nnolDciieB ATE 




<J I 


i i 


iS^Si 




_ 


1 







HHZM.7Veird-l,0ns 



Ir WXH/^. /Vtilrii-t nnj; 



lLR>fN/4.7Veird+0ns 



lci^lM/4,7V(?TrrJ-J.Mn5 



I CK2H/4 . 7 Vel rd- 1 . 0ns 



Part Name: Tornado 
S/W:E978 #22 | 
VDh :5.3V ^VDL :3.3V 
Sensitivity:2Vydiv 
Tiniebase: IDns/div 
-Oo 1 a^ : 70ns^ \ 



Date-6-23--9S | 

\ r i y pii r s OtJ ; -1 . B7u& 

"Irifi 1evc|:-l. BTiv 

lrir| iinpndancD: IMOhm 

Temp range: Unkrtotjn 

-Pari f rocfuGncy; 124MH2- 

Loop length: 



die that was FIBcd to buffer the LRXH signal. To do this, the 
long vertical metal 3 wire on the right side of the figure was 
cut with the FIB process ai^d a bufrer was Insetted. A buffer 
was available on the left side of the figure; however, metal 3 
covered Ibis i^uff'er Tbe FIB process wa^ used to etch the 
metal 3 area surrounding the buffer to escpose the metal 2 
comiectlons of tlie buffer The FIR process was then used to 
deposit metal to connect the metal 2 of the buffer to the 
vertical metal 3 wires. The FlBed cliip was then tested to 
make sure that the failing mechanism was fixed. 

Testability 

We leveraged the test circuits and strategy' for the PA 7200 
from the PA 7100 chip. The scan controller wiis required lo 
change from Durproprieiar>* diagnosiic instmction port to 
the industr>^-stajtdard JTAG. This was a miuinial change, 
since both protocols do the same function. The new test 
controller was leveraged from the PA 7100LC chip to keep 
the design effort to a minim mi\. Before tape release we veri- 
fied that the basic test circuits would work- 
Since the test circuits were leveraged from the PA 7100, the 
obvious choice was to leverage the test strategy- from the 



Fig. 6. Wavefomis ut tile internal 
sigitalE; of the failing lately circuit, 
capTiirc'd l>y electron -beam probing, 

1V\ 7100 (;hii> HS well. A fcist parallel pin tester was chosen 
early on as the tester for the PA 7200. Tliis tester would pro- 
vlcle both parallel pin testing antl scan testing. We decided 
that data path circtiits would be tested by parallel pin testmg, 
and scan testing w^ould be limited to a few t orurol blocks. 
Ail speed testmg was to be done with parallel piit lesling. 

Bench top Tester 

Since tiie paiallel pin tester was located elsewhere, we 
knew we could not use it for local debugging of the chip. 
Mat^y problems needed only simple debugging capability 
and cotild be greatly accelerated by I he presence of a local 
debugging station. For that puipose, we chose an inexpen- 
sive benchtop tester developed iutemally, Tliis tester atiplied 
all vectors serially to the chip. Vectors developed for serial 
tise coukt be used as is. The parallel pin vectors could be 
translated into what we called pin rectors, wliich is a 
boundar>' scan, looking-into-the-chip approach. No speed 
resting capabUir^' was plaimed, although some support for 
speed testing was present in the PA 7200. 

The PA 7200 chip has on-chip clock control This was essen- 
tial to oirr success because the benchtop tester was not 
practically able to provide a sepaiate clock control signal* 



40 



Febniar\' lOEJG Hewlett-Packard Journal 



)Copr. 1949-1998 Hewlett-Packard Co. 



Metal 3 Wires 



Irteial Deposited by FtB 
Proeess Used to Make 
Contiectiori 



Metal 3 Eiched by Ff8 
Pracess Used lor 
Exposing Metal 2 



Mme) 3 Cut Used ta 
Qiscpnuect a Signal 




Fig. 7. Qnc^ the latch problem 
sv^s found and a fix developed, 
the fix ^nas ^leiified by modilyiiig 
one die using a focused-ion-beam 
CFIB) process. Tfie long vertical 
metal ^J wtre on the righT was cut 
with the FIB process and a buffer 
WBS inserted. A buffer was avail- 
able on ttie left side of the figure; 
however, tnelal 3 covered This 
buffer. Tlie FIB process was used 
to etch the metaJ 3 area sur- 
rounding the buffer to expose 
the metal 2 caimections of the 
buffer The FIB process was then 
used io deposit met^ to coniieci 
the metal 2 of the buffer to the 
vertical metal Z wires. The FTBed 
chip was then tested to make 
sure t ha! the failing mechanism 
was fixed. Photo courtesy of 
R^T fFIB Applied Semicondue- 
tor Technology), San Jose, Cali- 
fornia, 



The tester can (and did) issue clock control commands in 
the serial data. Having these commands interpreted on-chip 
saved us from having to build that cia^uitry off-chip. This 
made the cliip test fixture very simple. 

The benchtop tester was the only means of standalone chip 
testing we had collocated witli the design team, and titere- 
fore was very uiiportant to the debugging efforts. The tester 
used a workstation as a controller and interface, and was 
capable of .storing verj^ long vectors (limited only by the 
workstation's virtual memory ). We had the aljility to load tlie 
entire parallel pin vector suite (590 million shilus) into ihe 
lienchtop tester at one time, althougli this look so long as to 
be practically prohibitive. The benchtop tester had both 
scan and some limited paiallel pin capabilities for driving 
reset pins, 

Benchtop Tester Environment 

The benchtop tester was based on an IlP-irX workstation 
and could be operated from a script. This allowed us to put 
OUT own script wrappers aroimd the software, which pro- 
\ided essential control for power supplies and the pulse 
generator. These script wrappers also pro\ided transparent 
workarotmds to some of the limitations of tlie tester. 

We harl two testers that we (controlled access to via HP T^sk- 
Broker. By using HP TaskBroker, we could easily share the 
test fixtures between the various uses, such as test develop- 
ment, chip debugging, juid aiilnniatic test verification. For 
chlfi deliuggiag, ati engini'cr rcjuki Libtain an interaclive lock 
on tlie tester (a window wotilil pop up when an enguieer got 
the tester), and did nut have to wony about interference 
from an uiiattended joli liying to run. Also, a test cotild be 
initiated from an engineer's desk, and when a tester was 



free, the test would run and return the results to the engi- 
neen HP TaskBroker handled all the queuing and priority 
iasues. 

As our experience increased and our needs became clear^ 
w^e wrote more simple scripts around those we already had. 
This allowed us to wxite complex functions i\s composites of 
simple blocks. 

Double Step 

As chip bring-up progressed, we found that w^e could benefit 
from some sini|>le local speed test capabiiities. As a result, 
we chose to imjj lenient basic speed testing on tlie benchtop 
tester stations we had in place. 

We employed programmahle pulse generators a^id had the 
software to control the frei^uency. All that was needed w^as 
to convert the tests to double-step pin vectors and make 
sure they worked. A double-step pin vector is the same as a 
single-step pin vector, excefit that two chip cycles are lun at 
speedn This retiuires that the I/O cells be able to store two 
\ allies, not just one as would be rei]uired for single stepping. 
This feature was akeady in the I/O cell design. 

By converting the tests to double-step pin vectors and mak- 
ing some minor changes to the design, we got double-step 
pin vectors working. This capability to do at-speed local 
testiiTg was very valuable iji debugging the chip. 

Additional Tools 

A .siutph' loo! was init together to produce sKmoo plots of 
abdut 50 [joiiiis for a single test. We spent considerable 
effort 013th ttix.ing tins script. The engineers doing debugging 
found this very vith table. 



)Copr. 1949-1998 Hewlett-Packard Co. 



!>hn]ar>^ IBWd Upwk^U-Parkiii'ii Jnurnal 4 1 



When doing speed path debugging, the engineer wants to 
know which cycles jire slow. One way to Tind out is to take a 
faihng test aud make some cycles slower, and if llie chip 
passes, that means tliat the chip was failing on one of t!iose 
cycles. JiLSl observing the pins is not enough, however, since 
a faihu'c may slay inside the chip for a while be^fore propa- 
gating to the pins. We implemented this kind of test by 
changing our pin vector strategy from a double step to a 
combination of half steps and single steps during selected 
cycles of the test. Since the clock conmiaiicLs take a long 
time to shift in, this efifectively slows down some cycles of 
the test, Wc call this style uf testing phase stretching (see 
Fig. 4). 

Another very valuable tool was an automated phase stretch- 
ing tool. It would take a ctiip and find the slow cycles witli a 
given set of tests. This w^ouJd take a few hours, but need not 
be supervised, so overnight tests worked well. While this 
won id not tell what the problem was directly, it provided 
valuable ciues. 

We also had the abiUty to nm part, of a test, then stop it tmd 
dump the state of the internal scan chains, A chip expert 
could look at these diunps and see what went wrong. Use of 
this tool was extremely useful during our debugging efforts. 

The benchtop testers were considered very valuable to the 
debiigging of the PA 7200. The softw^ai'e written for the 
testers contributed gieatly to their success. Tlie benchtop 
testers became known for their reliabilityj ease of use^ and 
locality, 

Design-Process InteracHons 

To achie%'e tlie highest quality in any \liSl product, it is very 
iniportant to ensure that there is good hannony in the rela- 
tionship between the chip design and tlie chip fabrication 
process. This relationship on the PA 7200 went through 
some rock^'^ roads and had its o^ii interesting set of prob- 
lems. In the end, however, the deshed harmony w^as achieved 
and is reflected in the high quality^ and jields of the final 
product. This section will describe the situation that existed 
on the PA 7200 and stjme of t he steps taken to anticii>ate 
and smooth out problems in this area. 

The characteristics of the IC process have a big influence on 
decisions made at ever>' step of the development cycle of a 
VXSl product, starting from the early stages of the design. 
The influence can be seen in many areiis like the goals of tire 
design, the feature set to be included, and die tletails of live 
implementation at tlie transistoi levef For example, the pro- 
cess dictates the intrinsic speed of the transistor, wiiich is a 
key factor in setting the frequency goals of tite chip. 
Similarly, the minimum feature sizes (line width, spacing, 
etc.) of the process kirgely dictate the size of the basic stor- 
age or memor>" cell 'Hus in turn is a factor that deteiTiunes, 
for example, the size of the TLB on tlie die, w^hich is a key 
component in determuiing the performance of a micro- 
processor An example of this influence at the implementa- 
tion level W'Ould be an input pad receiver designed to trip at 
a particular' voltage level on the external f in{>ut J signal. Tlie 
implementation has to ensime tl^t the trip level is fairly 
tightly contioOed at all confers of the process, winch is not 
easy to do. Another trivial example is the size of a power or 



ground trace. Tlie size of the trace required to carry a cer- 
taui amoimt of current is laigely dictated by the resistance 
and electroniigration limits of the metal. 

There were two target HP IC processes in mind when the 
design of t!ie PA 7200 began: rMOS2fi and rM0S14. 
CM()S20 was liie process of tire pre\ious generation CPUs, 
the PA 7100 and the PA 7100LC. Its benefits w-ere that it was 
a very mature ajid stable process. Also, some cucuits of the 
PA 7100 are used in the PA 7200 with httle or no modifica- 
tJon, and the behavior of tJiese circuits w^as weU-miderstood 
in this prot't^sM. CMOS 14 was the next-generation process 
bemg develo|>e[t Its benefits w^ere, obviously, smaller fea- 
ture size and hetter FET speed. However, only a few^ simple 
chips had been fabricated m tiiis process before the PA 7200, 
and many startup problems were likely to be encountered. 
Tliat we had a choice uifiuenced the design metliodology. 
Taking ativantage of the scalability of t MOS designs, the 
initial design was done in CMfJS26. An aitwork shrink pro- 
cess was developed to convert the design to CMOSl 1. The 
shrink process is a topic that merits special attention and is 
described m the article on page 25, 

As the design went along, it l)ecame clear that to meet the 
performance imd sue goals of the product, ("MOSU was the 
better choice- To demonstrate feasibihty and t(3 iron out 
problems with the sluink jsrocess, die existing PA 7100 CPU 
was taken tlirough tJie slirink process and fabricated in 
CMOS 14. Several issues were micoveredj leading to early 
detection of potential problems. 

Relaieti to the IC fabrication process, the goal of electrical 
verification and cbaraeterization is to ensure that the VLSI 
chip operates correctly for parts fabricated within the 
bounds of tlie normal process variations expected during 
manufacturing. An incomplete job done here or vaiiations of 
the process outside the normal range can cause subtle prob- 
lems that often get detected much later on. There lire two 
yield calculations that are often used t(j quantify the ni^mu- 
factiu*abihty of a VLSI product. HwfiiiictkiftuI yield denotes 
the fraction of the total die mfinufaclured that are fully oper- 
ational (or functional) at some plect.rical opeiating point, 
that is, some coinblnation of frequency, voltage, and temper- 
ature. The survival yield denotes the fraction of the func- 
tional die that are operational over the entire electricai opei^- 
ating range of the product, that is, the product specifications 
for frequency, voltage, and temperature. {li\ reahty, to guar- 
antee this, there is some guardbtmding that occurs beyond 
the opemiing range of the product ) 

To achieve the highest quality' and manufarturabihty of the 
final product, tire following are some of tlie objectives set 
for electrical characterization: 

Ensmc that the design has sohd operating margin {in volt- 
age, frequency, and temperature) for parts fabricated at all 
the different comers of the process. 
Easure corLslstently liigli sunival yield for a statistically 
large number i>f wafers and l<.)t.s ral>ricated. 
To ferret out problems that may he otherwise haid to find, 
fabricate some paits at pomts beyond Uie normal variations 
of the process. Debug problems in these parts to ensure tiie 
robustness of the design. 

The PA 7200 chip was the first coiiTplex VLSI chip to be fab- 
ricated in CMOS 14. That the process was not fully mature at 



42 Febru^iy \^^ Hewlett-Pat-kard Joiim:Ll 



)Copr. 1949-1998 Hewlett-Packard Co. 



that poini had importani implications on the electrical char- 
act erizatior and debugging effort. Special care had to be 
taken to dislinguish between the different t>pes of problems; 
design problems, process problems, and design^rocess 
interaction problems, 

A toidamental design problem is one that shows up on every 
lof (a batch of wafers processed together in the fabrication 
shop ), wliatever the process parameters for tliat lot might 
be. For example, a reall>' slow path on the chip niay tmve 
some frequencj^ vanation from lot to lot, but will show tip on 
every lot. 

Process problems show* up on some lots. The most common 
symptom is poor functional yield. Sometimes, however, the 
symptom can be a subtle electrical failure that is hard to 
debug. For example, one problem manifested itself as a race 
between the falling edge of a latcluiig signal and new data to 
the latch. SPICE stmulatioivs showed that the failure could 
occur only under abnormally mtbalan ced loads and clock 
skews, which were tinrealistic. 

A design-process interaction problem shows up to varying 
degrees on diffeient lots. It points to a design that is not 
vei;y robust and is treated very much like a design problem. 
However, typically there tends to be some set of process 
conditions that aggravate the problem. Tighter process 
controls or retargettiiig dte proct^ss slightly can reduce die 
impact of such problems temponirily but tiie long-term 
solution is alw^ays to fix the design to make it. more robust. 
For exaitiple, sonte coupling issues on the PA 7200 occmred 
only at one comer of the process. By retargetting the pro- 
cess to chminaLe t hat comer, the stirvival yield was signifi- 
cantly increased. 

The sltiink process mentioned earlier had given us tremen- 
dous bejiefits in terms of flexibihty and the ability to leverage 
existmg ciicuits. However, effort was also spent in identiiy- 
it^g circuits tliat rMd not slirink very welL These (circuits were 
given special erne anti modified when the decision to use 
CM0S14 was made. Overtill, the shrink effort was very suc- 
cessful largely because of the scalaliility of CMOS designs. 
However, the chjiracteri nation and debugging phase exposed 
some interesting new^ limitaiiiins on the scalability of CMOS 
designs. When ashrirrk-relatefl problem circuit was fotmci, 
the chip was scanned for other circtiits that could have a 
similar problem, These circtiits were Oien fixed to preveni 
future problems. 

Tltroughout the project, ! he team always tracked dowit 
problems to their root catises. This approach guaranteed 



complete fixes for problems and kept them from ever show- 
ijig up again- The result is high-quality bug-free parts and 
high yields in manufacturing. 

In addition to finding and fixing problems in the design, 
there was also a related activity that happened in parallel 
Process parametric data w as analyzed in detail for eveiy lot 
to look for an optimunt region in the process. Detailed cor- 
relation data was produced bemeen process parameters 
and chip characteristics tike speed* failing voltages, types of 
failmres, and so on. Many differeJU experiments with process 
parameters and ntasks were also tried, including poij-siiicon 
biases, metal thicknesses, and others. This enabled lis to 
fine-^time the process to increase the mar^ns, yields, and 
quaht>' of the product 

Conclusion 

With r he iti creasing complexity of VLSI chips, specifically 
CPUs, design verification has become a critical and chal- 
lenging task. This paper has described the methodology^ and 
techniques tised to verify the PA 7200 CPU. The approaches 
used yielded very good results and led to the efficient detec- 
tion and isoiaiion of problems on the chip. This has enabled 
Hewletl-Packm^d to achieve high-quality, volume shipments 
in a timeiy manner. 

Aekno w 1 e dgme n ts 

The authors would like to thank all the different teams who 
contributed to the successful verification and characteriza- 
tion of the PA 7200 chip. Special thanks to the many individ- 
uals from the Computer Techno log^^ Lal)oratt)r>' and tite 
Cupertino Systems Laborator>^ in Cupertino, California who 
were itivoh^ed. Many thanks also to our key partners: the 
Engineering Systems Laboratory in Fort Collins, Colorado, 
the Integrated Circuits Busiitess Division in Fort Collins and 
Corv^alJis, Oregon, the General Sy stents Laboratoiy in 
Rosevllle, Califonna, the Chelmsford Systems laboratory in 
Chelmsford, Massachusetts, the Colorado Computer 
Mitntifacturing Operation in Ft^rt Collins, and the Cupertino 
Open Systems Laboratoiy in Cupertino. 

HP-UX 9 ; and 1 0.fl for HP 90K) Seri&s 7D0 and ODO romputars are X/Open Cofif3any UNIX 93 

hfanded products. 

UN(X'^ \sa TBgl^tered rpademark Jn the United Statss and other coumries, ficersed exchisivefy 

Through X/Clpen Dsmpany Limited. 

X/Open^'- IS a [^^^►siered trsdi^mark and the X device is a tradamafk of X/Open Corrpanv 

Limited m the UK and other cauntrles 



)Copr. 1949-1998 Hewlett-Packard Co. 



Februfliy 1996 Howlett-Panlcard M lumal 49 



A New Memory System Design for 
Commercial and Technical Computing 
Products 



This new design is targeted for use in a wide range of HP commercial 
servers and technical workstntinns It offers improved customer 
application performance through improvements in capacity, bandv\/idth, 
latency, performance scalability, reliability, and availability. Two keys to 
the improved performance are system-level parallelism and memory 
interleaving. 

by Thomas R_ Hotchkiss, Norman D. Marschke, and Ricliard M. McClosky 



Initially lised in IIP 9000 K-class midrange commercial serv"- 
ers sa\<\ J-class Mgh-enci teclmical workstations, the J/K- 
cIbss memory system Is a new design targeted for use in a 
wide range of HP's commeixial and technical computing 
products, and is expeeicni lo migrate to lower-cost systems 
over time. At the inception of (he memoiy design project, 
there were two major objectives or themes that [leeded to 
be aci dressed. First, we focused on providing niaximuni 
value to our customers, and second, we neecJed to maxiniize 
1 IP's return on the development investment. 

The primary castomer vatue proposition of the J/K-cIass 
memory^ system is to maximize applicatirm peribmiance 
over a range of important cost points. After Inteiisive studies 
of our existing computmg platfoiTus, we determined that 
memoiy capacity, memory bandwidth, memoiy latency, and 
system-level paiaOelism were key parameters for impro\dng 
customer application peiformance, A mtyor leap m memory 
bandwidth was achieved tlnough system-level paraUehsm 
and memory^ interlea\1ngt which were designed into the 
Rmiway bus and the memory subsystem- A system block 
diagram of an IIP 9000 K-class server is shown in Fig. I on 
page 9. Tlie Runway bus (see article, page 18) is tl:ve "infor- 
mation superhigliway' that connects the CPUs. memor>; and 
I/O systems. System -level parallelism and memoiy interleav- 
ing means that multiple independent memoiy accesses can 
be issued and processed simultaneously. This means that a 
CPirs access to memoiy is not delayed while an I/O device 
is using memor^v In a Kunw-ay-based s>'Stem with the J/K- 
class memoiy sysieni, multiple CPUs and I/O de\ices can all 
be accessing memoiy in jiaralleL In contrast, m^iny of HP's 
earlier computing platfoniis can process only one memoiy 
transaction at a time. 

Another uuportant customer value proposition is investment 
protection through performance scalability. Performance 
scalabihty^ is offered m two dimensions: synmietric multipro- 
cessing and processor teclmology upgrades Lo I lie Ibrtlicom- 
ing PA 8000 CPU, The J/K-class memory^ system provides the 
memory capacity and bandwidth needed for effective per- 
formance scahng in four-way multiprocessing systems. 
Ijiitialiy. Rim way-based systenis will be offered with the 



PA~7200 CPU f see article, page 25), and will be upgradable 
to PA 8000 CPU technology wath a sunple CPU module ex- 
change. The J/K-class memory system will meet the demand- 
mg performance requirements of the PA 8000 CPU. 

Perfonnance is only one part of averall system value. An- 
other ins^lor component of system value is cost. For exam- 
ple, die use of commofhty DRAM technology- was imperative 
because competitive memoiy^ pricing is an absolute require- 
ment in the cost-sensitive workstation marketiDlace. Tlie 
J/K-class memoiy system provides lastuig performance with 



Runway Bus 



I 



(Vfemory Address Bus 



Masier 
Memory 
Corn roller 



BD_A[D;71] 



} 



RAM 

D^ta |RD„.A[71;143] 

Bus 



Siave 
Memory 
Controller 



Meniory 
Com roller 




Data Btts {0:71] 
BankO 



Fig. 1. Eotry-levei memoiy sysLeni. 



44 February 1 996 Hewlett-Packard J ou ma 



)Copr. 1949-1998 Hewlett-Packard Co. 



coimnodiJi^ memory' pricing and industfj -leading price/per- 
formance. Low cost was achieved by ustiig mature IC pro- 
cesses, rommodity DRAM techiiology, and low -cost chip 
packaging. A closely coupled system design approach was 
combined \\ilh a stmcruredHCustom chip design metl^odol- 
ogj^ that allowed the design teams to focus custom design 
efforts in the areas that provided the highest performance 
gains wiilKHit dm*tng up system cost. For example, the sys- 
tem PC boarris. DR.\M memorj* moduleSp and custom chip 
L'O circuits were designed and optimized together as a 
single highly tuned system to achie^^e aggressive memor>- 
timing with niature, low-cost IC process and chip packaging 
technologies. 

A further customer value important m HP computing prod- 
ucts is reliability and availability The J/K-class memoiy sys- 
tem delivers high reltability and availability with IIP propri- 
etan' error detection and correction algorithms. Single-bit 
enor detection imti correction and double-bit error detec- 
tion aie in\])leiuentedf of course; tiiese are fairly standard 
featiues in nwdem, liigh-performance computer systems, 
Tiie J/K-ciass memory system pro\ides additional availabil- 
ity by detecting single DRAM part failures for x4 and k8 
DRAM topologies, and by detecting atldiTCssing errors. Tlie 
DRAM part failure detection is particuhuiy iruportant be- 
cause single-part failures are more comtnon thmi dotible-bh 
eiTors- Extensive circuit sinntlation atKl nun gin and reliabil- 
ity testing easure a high-quality electrical design that mini- 
mizes the occuiTence of errors. 

Finally greater system reliability is achieved through com- 
plete n\enior>^ testing at systen) boot titne. Given the large 
maximun\ memory capacity, nien\ory lest time was a m^or 
conceni, W\wn ftill juemoi'y testing takes a long time, cus- 
tomeiis tnay be itK^lined to disable complete memoiy testing 
to speed up the boot process. By using custom flrmwaic tesi 
routines that capitalize on system- le\^el paiailelism and the 
high bandwidth i aijahiliiies of the memory^ system, a full 2G 
bytes of memory* can be tested in less than five minutes. 

Return an Investment 

Large-scale ciesign projects like the J/K-class memorj' sys- 
tem typically have long de\ek)pmeiit cycles atid re(]Uire 
large RM) irivestments. To tnaximi^e the business rctunt on 
large projects^ flesigus need to provide lasting value and 
cover a wide nmge of prockicts. Return on itwestment can 
be increased by irnprovirig product iviiy aii<i i educing time to 
market. Le%'eraging and outstnircing should he used as ap- 
propriate to keep HP engineets focused on those tiordons of 
die design that provide maxunun> business value. Tlie J/K- 
ckiss Tuemt>r>' system jjroject achieved all of these important 
objectives. 

A luoduiar architecture was designed so thai different 
memory subsysteui implenientaiions cm\ be const nic ted 
using a set of buikling blocks wUii simple^ r>RAM4echtio!ogy- 
inflependent interfacfs. This Hexible arcbilectinv allows the 
nieuuHT i^ystetn to Ijctised in a wide r^uige nf jjroducts, eacli 
with different price and perfomiance points. Given tlte long 

* Hf Joufivil mamorv jne convention 

1 ktjvte s I.OflO bytea 1 K b/tes ^ 1024 trytes 

1 Mtjyia s 1 , OOO.OOO byies 1 M bytes = 1 .048376 tjyms 

1 Gliyie s IQOO.000.DOO layies IC layt&s = 1,073,741J24 hyies 



de%elopntem cycles associated with lar^e MM design proj- 
ects, changing niarket conditions often require \'L»SI chips to 
be used in products that weren't specified during the design 
cycle. The flexible, modular architecture of the J/K-cIass 
memory^ system increases the design s abihty to succeed! in 
meeting unforeseen market retjuirements. Simple interlaces 
between modules allow components of the design to be lev- 
eraged easily inio other memoiy desi^ projects. Thus, return 
on investment is maximizetl iln^ough flexibility and le^'erage 
potential, 

Reducmg complexity is one of the most iM>werful lechniques 
for hnpro\ing time to market , especially mtl\ geogr^hically 
diverse design teams and concurrent engineering. A single 
critical bug discovered late in the development cjxle can 
easily add a montli or more to the schedule. In the J/K-class 
memoiy project, design complexity' was significantly reduced 
by focusing on the business value of proposed features. The 
basic philosopliy enujioyed was ro intrude features that pro- 
vide 809ti to 9(M of the customer beneiits for 2(J% of the effort; 
this is the same pliilosophy that drove the development of 
RISC computer architectm-es. The dedication to reduced 
complexity coupled with a soong commitment to design 
verification produced excellent results. After the iiiitiai de- 
sign release, only a single functional bug was f1xe<i m three 
unique chips. 

Several methods were employed to increase productf\ity 
without sacrificing peifomiance. First, the memory systen^ 
archhectiu*e uses a "double \\ide. ht'df-sijced'' approach. Most 
of rhe memory system runs at half tlie frequency of the high- 
speed Runway bus, but the data patKs aie twice as wide so 
that full Rim way band^vidth is provided. This approach, 
coupletl with a si met tired-custom chip design metliodology, 
made it. possible to use Itighly automated design processes 
and a mature IC process. Custom tiesign teclmiques were 
limited to targeted portions of the. design dial provided the 
greatest benefits. Existing low-cost packaging techno] ogies 
were used ajtd significant p< jrtioas of the design were out- 
sourced to thirel-|>arty iJartners. L'suig all thesp techniques, 
liigli jjerlormauce, low cost, and high productivity were 
aciiieved in !lu*.l/K-cIiiss memoiy system design i^roject* 

Wide Range of Implement a tions 

Tho nu'UK Jiy system design for the jyK-class family of com- 
puters covers a wide range of memory sizes from 32M bytes 
m the entry-level workstation f Hg. 1 J to 2G bytes in the fully 
configm'ed serv^er (Fig. 2). Expajidability is achieved with 
pliig-ui dualii^line memoi^' nK>dules tliat each cany 36 '^IM-bit 
or KiM-bil DRAI^fs. (Note: Even tltougb the memoiy modules 
aie duai-uiliue and can be called DIMMs, w^e usually refer to 
them as SIMMs, or single-inluie meuior>" modules, because 
this temiinologj^ is so conmiotily used ajicl familiiu-.J Because 
each DRi\M data bus in the memory system is 16 b^tes wide^ 
the 8-byte wirle SIMMs Jire always installed in pairs. Using 
the 4M-liit DRAMs. ea* h pair of SIMMs jjrovi(k*s 32M fijles 
of memory: and with UiM-hit DRAMs, a pair of SIMMs pro 
vides 128M byies of memory. 

hi an entry-level workstation, the memory can start at 32M 
hyies (one pair of SIMMs with 4M-bit ORAMs) and be ex- 
paruh*d up to 512M bytes (using 1 t>airs of SIMMs with 
PlM-hii l)R\Ms) as shown in Fig. L Tiie HP OtJOO .j-class 



Ft'tmiarv' 1 1JfKi I Emv U 1 1 1 - fiirkiif cl J « ultu id 4 5 



)Copr. 1949-1998 Hewlett-Packard Co. 



Runwey Bus 



T 



Memory 




SJMM$ SIMMs 

Fig. 2. High-peifomiance HP 9000 Model K400 rnenioo' system, 

workstation can be expanded to IG bytes of memory using 
eight pairs of SIMMs witli 16M-bit DRAMs. The HP 9000 
Model K4()0 sender can be expanded up to 2G bytes using 16 
pairs of SIMMs with 16M-bit DRMIs installefi in two mernory 
carriers. 

Design Features* 

The J/K-class memory system design consists of a set of 
components or building blocks that can be used to construct 
a variety of high-performance memory systems, A priniary 
goal of the nientoo' system is to enable system designei^s to 
build high-band wldtht low -latency, low-cost memoiy systems, 
Ma,joF features of the design aie; 

• Iligh perfoiTnance 

• 36-bit rcid address C32G bytes) 

• StJpport for 4M-bit, IGM-bit, and 64M-bit DRAM tecluiology^ 

• Proven electrical design up to 2G bytes of memoty wltii 
16M-bit DRMIs (up to 8G bytes witii 64M-bit DRAMs when 
available) 

• Logical design supports up to 8G bytes with 16M-bit DRAJVIs 
and up to 32G bjtes with 64M-bit DliAMs 

• Mhunium memory^ increment of 32M bytes with lMx4-bit 
(4M-bit) DILAMs (2 banks of memory on 2 SIMMs) 

• 32'byte cache Imes 

• Memory uiterlea\ing: 4'Way per slave menior>' controUei; 
electrically proven up to 32-way with logU al maximum of 
128'Way interiea\^ing 

• Single-bit error correct! on, dotibie-bit error detection, smd 
smgle-DRAM de\ice failure detected for x4 and x8 parts 

■ Address error detection 

• Error detection, containment, and reporting when corrupt 
data is received 

• Meniorj' test and initialization less than 5 minutes for 2G 
bytes of memory' 

• Soft -error memory' scrubbing and page deallocation imple- 
mented with softw^are 

• 16-byte and 32-byte write access to memoiy 

• IEEE 1 14a i boundary scan ui all VT^I parts. 



SIMMs 



SJH/IMs 



Memory System Description 

A i>lock diagram for a higb-perlojfaumce HP 9000 Model 
K400 memory system is shown in Fig. 2. The memory sys- 
tem has fom' major components: the master memory con- 
troller (MMC), multiple slave meniory eortt rollers (SMC), a 
data accumulator/multjplexer (DM), and phig-in memory 
modules (SIMMs). The memory system design allows many 
possibk* configurations, and the liigh-perfomrance Model 
K400 system is an example of a large configuration. 

The basic miit of memory is called a bank. Each bank of 
memory is 1(3 data h^Jtes wide and can be addressed inde- 
pendently of all other banks. A B2-byte processor cache line 
is read or \\Titten in two segments using fast-page-mode 
accesses to a bank. TWo 16-byte segments are transferred to 
or from a bank to ntake up one 32-byte cache line. 

Each slave memory controOer f SMC) supports up to four 
independent memory banks. Memor>^ perfonnmire is higlily 
dependent on tJie number of bimks, so the SIMMs are de- 
signed so tiiat eacli SIMM contmns eight bytes of tw'o banks. 
Since a bank is 16 bytes mde, the minunum memory uicre- 
ment is two SIMMs, which yields two complete banks. .An 
additional 16 bits of error coiTection code (ECC) is included 
for each 16 bytes (128 bits) of data. Thus, a memor^^ data 
bus canying 16 bytes of data requires 144 bits total (128 data 
bits + 16 ECC bits). 

The 16-bytl^-wide memoi^^ data bus, which connects the 
master memor^^ controller (MMC) to the data multiplexer 
(DM) chip set, opei"ates at 60 MHz for a peak bandwidth of 
960 Mb>tes/s. Memory banks on the SIMM sets are coiv 
nected to the DM chip set \ia it>byte RAM data buses t RD_A 
and RD_B), which operate at 30 MHz, yielding a peak band- 
width of 480 Mbytes/s. However, these data buses are inde- 
pendent, so if memory' access operations niai* lo alteniate 
buses, tlie peak bandwidth available from RD^A and RD_B 
equals that of the memonr' data bus. Tlie actual bandwidth 



46 Februaiy 1B96 Hewlptt Packard , Jot i ni fd 



)Copr. 1949-1998 Hewlett-Packard Co. 



vtiU depend on the niemor>' access pattern, which depends 

on the s>3tem workload. 

The sei of signats connecting the MMC to the SMC chips and 
DM chips is coUecdvely kno^Ti as the memoi>' system inter- 
connect (MSI) bus. h is shomi in Fig, 2 as the memory ad- 
dress bus and the Ml'X data bus, 

A 32-byte single cache line transfer requires t%^o cycJes of 
data on the RAM data and MSI buses. Since the RAM data 
bus operates at one-half the frequency of the MSI bus, the 
data multiplexer chips are responsible for accumulating and 
distriburing data between the Ml'X data bus and the slower 
RAM data buses. To reduce the cost of the data Ml'X chips, 
the design of the chip set is bit-sliced into foiU" identical 
chips. Elach DM chip handles 36 bits of data and is packaged 
in a low-cost plastic quad flat pack. 

Two sets of DM chips are shown in F^g. 2, and four SMC 
chips are associated with each DM set. Logically, the MSI 
protocol supports up to 32 total SMC chips, up to 32 SMC 
chips per DM set* and any number of DM sets. Presentlyt 
menior^^ systems having up to eight SMC chips and two DM 
sets have been implemented. 

\XSI Chips 

Master Memory Controller. Each memory system contains a 
single master memory controller. The MMC cMp is the core 
of the memory system. It conmimiicates with the processors 
and the I/O subsystem over the Rim way biis^ and generates 
basic memor>' accesses w^hich are sent to the slave memory 
controllers \1a the MSI bus. 

Slave Memory ControEler. A memory system contains one or 
niore SMC chips, which are responsible for controlling the 
DRAM banks based on the accesses generated by the MMC'. 
The partitioning of functionality between the MMC and SMC 
has been carefully designed to allow futtire support of new 
types of DRAMs without modifii-atlon to the MMC. The 
motivation for this is that the MMC is a birge complex chip 
and would require a large engineering effort to redesign or 
modify- The SMC and DM chips are much simpler and can 
be redesigned with less effoil on a faster schedule. The 
memory system design is partitioned so that the MMC does 
not contain any logic^ or fimctiotKility ihat is spec ific to a 
particular tyjje of DliAM, The following l{>gic and functional- 
ity is DRAM-specific atul is fherelbre included in the SM('s: 

• DR.\M timiriji? 

• Refresh control 

• Interleaving 

• Memor>' and SIMM < on figuration registers 

• DM control. 

Each SMC controls up to four banks of memory. 

Operation of the DRAMs is controlled by multiple slave 
memory controllers which re(*eive memory access com- 
mands from tlie system bus through the master memory 
controller. Commands and addresses are received by aU 
SMCs. A particular SMC responds only if it controls the re- 
quested address and subsequent ly drives the approi>riate 
DRAMs wilh the usual row address strobe (HAS) and column 
address strobe (CAS J. 

The slave memory controller chips have conllguration regis- 
ters to support, the followhig functions: 



• Interleaving 

• Bank-to-bank switching rates 

• Progranmiable refresh period 

• SIMM sizes 

• Progranmiable DRAM tinung 

• SMC hardwaune version number (read only) 

• SMC status registers for latching MSI parity^ errors. 

Memory^ refresh is performed by all of the SMCs in a stag- 
gered order so (hat refresh operations are nonsimultaneous. 
Staggered refresh is used to hmit the step ioad demands on 
the power supply to minimize the supply noise that would 
be caused by simultaneously refre-sliing all DRAMs (up to 
1 152 in a Model K400 sy^stem). This lowers overall system 
cost by reducing the design requirement for the power 
supply 

Data Multipfexer Chip Set The DM chips are responsible for 
accumulating, nmltiplexing, and demultiplexing between the 
Iti-byte memory data bus and the t\^^o mdependent 16-byte 
RAM data buses. They aie used only in high-performance 
memory systems with more than eight banks of memoiy. 

Dual- Inline Memory Modules 

The dual-inline memor>' modules (called SIMMs) used in this 
design are 72 bits wide (64 data bits + 8 ECC bits) orgjmized 
into tw^o half-banks as shoviin in Fig. 3, With 72 bits of shared 
data lines and t\^^o independent sets of address and control 
lineSp they hold twice as much memory and supply twice as 
many data bits as the common single-inline memory modides 
used in i>ersonal computer. Two SIMMs aie used to form 
two fully independent IG-byte banks of memory^. Each SIMM 
holds 36 4-bit DRAMs— 18 DRAMs for each half-bank. Using 
36 lMx4-bit DRAMs, each SIMM provides I6M bytes of 
memory. With 4Mx4-hit DRAMs, each SIMM holds 64M bytes. 
To comiect all of the data, address, and control signals, a 
144-pin diial-inlijie socket is used (not compatible with Uie 
72-pin SIMMs tised in PCs). 

The motivation for designing our own memory module 
rather than using an industry-standard SIMM was memory 
capacity and perfonnance. We wotild l^ave needed twice as 
matiy PC SIMMs for an cttuivaleut amount of memory. This 
would create a physical space problejn because ti^icc as 
many connectors would be needed, and would create per- 
fomiance problenus because of the mcreased printed circuit 
boai-d trace lengtiis iuxd utttenuinated transmission line 
stubs. 

However* our custom SIMM is expected to become "industry^ 
available." There are no custom Vl^Sl chips iticluded on our 
SIMMSj so tlui'd-party suppliers can clone the design and 
offer nu'iuoiy to HP customers. Tliis Is v^eiy important be- 
cause having nniltiple suppliers selling memoiy to our cus- 
tomers ensures a market -driven price for memory rather 
than proprietary pricing. HP customers now demand com- 
petitive mt^moiy pricing, so Uiis was a '^must'* requirement 
for the tiesign. 

Banks and Interlea\ing 

Each SMC has four inciet>endent h;nik tun) rollers thai per- 
ftuTii a double read or write openjhuti lu juiUrli the 16-byte 
width of the RAM data path to the 32-toyte size of the CPU 



FrbrujAO' 1 inm I ic w l^tt -ParkanI ,hmu\i\\ 47 



)Copr. 1949-1998 Hewlett-Packard Co. 



4^111 



O 



ODDDIimDKDDDOSDDODSDfflDSDDDOSODBDllD 



I 
■-1- 



O 



.^.DDQSDODffiODDDSDDiSDODQaiDffiiDffiffl 



Address 



Data [B:35] 



36' 371 

Contjfll 




II 



Data [36:71] 



Control 
\ 



72 



I Mdre&fi | 



Upper 
^ Kali 
Bank 



Lower 
^ Half 
Bank 



Figi 3, Diagram of rlie dual-inHn<? 
memory morjyle (called a SIMM 
because its functicn is similar tc 
the faimiiar single -inline mt^mory 

modules used in PCsJ. 



cache line. Thus each iTiemoiy access operation to a particu- 
lar baiik is a RAS-CAS-CAS* seQiience for reading two succes- 
sive memory locations, using the fast-page-mode capability 
of the DliAMs. A similar sequence to ajiotlier baak can over- 
lap in time so I hat the -CAS-CAS portion of the secotKl bank 
can follow immediately after the RAS-CAS-CAS of (lie initial 
bank. Tliis is interieavlrig. 

N-way interlea\1ng hi impienienled, where N is a power of 2. 
The total number of banks in an interleave group is not nec- 
essarily a power of 2. When tlie number uf banks is a power 
of 2, then the bank sek^ct for a given physical address is de- 
teniiined by tlie N low-oider bits of the physical address. All 
banks witlmi an interleave group must be the same size. 
Memory banks from different-size SDVIMs can be installed ii\ 
the same memorv' sulu system, but they must be included in 
different interleave groups. 

When the nmnber of banks installed is not a powder of 2, the 
interlca\ang algorithm is speciiiily designed to provide a mii- 
fomi. nearest-powx^rH3f-2 interleaving across the entire ad- 
dress range. For example, if you install six banks of tlie same 
size, you will get 4-way interleaving across all six banks 
ratiier than 4-way interleaving across 4/(>ths of the memory 
and 2-way interleaving across 2/6ths of the nten^t^ry; This 
special featme prevents erratic behavior wlien nont)o\\Tr- 
of-2 nmnbers of banks ai'e installed. 

Soft Errors and Memory Scrubbing 

DRAl^l devices are known to lose data periodicalJy at single- 
bit locations because of alpha particle hits. The rate of oc- 
cmrence of these soft errors is expected to be one evet> 
1 million hom^s of operation. A fully conligmed 2G-b>te 
memor>^ system uses 1152 (-32x36) DRAM devices. Thus, a 
soft single-bit eiTor is expected to occur once ev^ery 868 
hours or ten tunes per year in normal operation, Single-bit 
en-ors me easily corrected by the MMC when the data is 
read from the menioiy using the ECC bits. Single-bit errors 
are corrected on the fly ^ith no peiformai'ice degradation. 
At niemoiy locations that are seldom accessed, the occur- 
rence of an uncorrectable doutile4>it enor is a real threat lo 
proper system operation. To niitigate tMs potentiaJ problem^ 

■ SAS = Ftaw atJdfess sitddb. 
CAS - CQlumn atidrsss sirgbe 



memoiy-read operations are periodically perfonned on all 
memory locations to find and connect any single-bit eiTors 
before they become double-bit errors. Tins memoiy^ scrub- 
bing softwaie operation occiu's in the background with 
virtually no impact on system performance. 

Sometimes when a particular DRAM device has a propensity 
for soft enors or tk^velops a hard (uncorrectable) error, then 
that area of niemor>^ is deemed unusable. The memoiy is 
segmented into 64K-byte pages. When any location within a 
particular page is deemed unusable, then that entire page is 
deallocated from the inventoiy of available memory and not 
used. Should the nmnber of dealloc!ated pages become 
excessive, the respective SIMM modules are deemed faulty 
and must be replaced at a conv^enient service time. 

Memory Carrier Board 

Tlie memory carrier board (Fig. 4 J is desigjied to fimction 
with HP 9000 K-class m^d J-ciass computer systems. The 
larger K-class s>^stem can use up to two memoi^^ carrier 
boards, wliile the smaller J-class system contains only one 
board, w^hich is built into tlic system board. Each memoiy 
carrier board controls from two to sixteen SIMMs. This 
allows each memory* carrier board to contain from -lEM 
bytes to 1(4 bytes of memory. 

There are four data multiplexer chips on the memory carrier 
board. These inulliiilex the two 144-bit RAM data buses 
(RD^A and RD^SJ to ihe single 144-bit MSI data path to ttie 
MMC chip. They also provide data path timing. Four SMC 
components on the memoiy carrier l:>oarci provide the MSI 
interface control DRAM control and tioiing, data MUX 
control, refresh timing and control, and bai\k mapping. 

The memfiiy carrier hoard is designed with maximal inter- 
lea^'ed memoiy access in mind. Eacii SMC controls foiu' 
SIMM pairs (actually only four banks because there are two 
banks on each SIMM pair) and one data bus set (two of four 
72-bit parallel RAM data buses). Each data MUX controls 
36 bits of each ILAM data bus and 36 bits of the MSI bos. 
Each SIMM has tw o address buses {one for each bank ) and 
one 72-bit R.AM data Bus. For example, SMC controls 
SIMM pairs Oa/l>, 5a/b, (jwlx and 3a^b, using the RD_A0{0:7b 
bos for the SIMMs in the "a" comiectors tmd the RD,AltO:71J 
bus for the SIMMs in the "b" connectors. 



48 Febniar>' 1PB6 Hewlptt-Packairl Journal 



)Copr. 1949-1998 Hewlett-Packard Co. 



Each SIMM hfts two address buses and one 72-brt data bus. 



SJMMf_ 
ii 

Ob 

Ah 

Sb 



i »- ^ ■ n ^ 




^— "■ 


mm _a . 


■ ■ 




f ' 


. . 


■ 




^ ■ ■ . 






M ' »■ 


^ : ■ . 




! ■ «■ 



=kk: 



4+4- 



^ii^i 



%mm 



tJAlGl 



fdjAZel 



la 
5a 

lb 
Sb 
7a 
3» 



mm: 



3b E 






SIVIC«3 



cOftUi J 



caftl£l 



SMC«0 



ditHtl 



fdJAIcl 



flD_A9lD:71| 



i f- 




cOAJct 



Baake 
Baflkl 

BaokO 
Bankl 

BsnkQ 
Bank 1 

BaakS 
Bsaki 

SaokO 
Banki 

BaokO 
Bank 1 

BaakO 
Baafcl 

, Banks 

WW^ Bftokl 



GOAeet 



SMCH 



dlA2<t1 



iE 



Bmk n 
Baaki 

Bmkf 

BanhO 
Bank 1 

BankO 
Bankt 

BaitkO 

Baitkl 

BankO 
Bankl 

BafikD 
Battkl 

BafikD 
Bankt 



— • — 



Aia-JiJ 



Data 



RDJ1!a:71» 






fAiaftT^ 



S(3t7t 



D(7£1D7| I 

I Data MUX #1 



T 



I 



oifoe:i«i 



J 



IWJ J 



MSI DATA(0:143| 
NOTES: 

£D,cl -> DRAM control signafs RAS L CAS.l WE u OE_l 
AD.Ai.A2,Aj -:> Row and column addresses, one sal for each bank. 
Components with dashad outlines ate located on botton) of board. 

Fig, 4. Memory cairier l>oard archite^^ture. Each SIMM has two address buses and one 72-bll data bus. 



Memory Read or Wnte Operations 

Tfu' memory ('atritn' btjard oppratc^s on baaicaUy one kind of 
memory traiisaction: a Ti2-byte read or write to a single banJ< 
address. Tbis board v^iiJ also iiandle a lf>b>1e read or write 
operaiion: the timing atid addressing are tiand led just like a 
32-i)yie operatitni exrej)! that the CAS signaJ for the imiised 
16-byte balf is not asserted. 



To perfoim a memoiy read or writp operalion the following 
sequence of e\'ents occurs. First, an address cyele un tbe 
Riijjway biLs requests a nienior^^ transac^tion from I he memory 
subsystem. Tlus pan of the operation is taken care of by the 
MMC\ The MMt" chip then places this address onto the MSI 
bus along v^idi the trmisaetion type rode (read or wiite). All 
t he SM( s and the MMC latch tJiis address and transaction 



Pt'bruajy 1096 tfewblt-Packard Journal 49 



)Copr. 1949-1998 Hewlett-Packard Co. 



code aiid place this infomiation into Itieir queues. Each SMC 
and the MMC chip niusl keep an identical sei of transaction 
requests in their queues; this is how data is synchronized for 
each memoiy operation (the SJVICs operate in parallel and 
independently on separate SIMMs). 

Once the address is loaded into their queues, the SMCs 
check to see if the address matches one of the banks that 
they control. The SMC that has a match will ihen start an 
access to the matching bank if that bank is noi already in 
use. The other SMCs could also start other memory ac- 
cesses in parallel provided there are no contlicts with banks 
currently In use. 

The memory access starts with driving the row address to 
the SIMMs followed by the assertion of RAS. The SMC then 
drives the column address to the same SIMMs, Tliis is fol- 
lowed by the assertion of CAS and outpiil enable or write 
enable provided that the data buses m*e free to be used. At 
this time the SMC sends the MREAD signal or the MWFtlTE sig- 
nal to the data MUX chips to tell them which direction the 
data will be traveling. The TACK (Inmsaction acknowledged) 
signal is toggled by Oie SMC' to teU the MMC chip to supply 
data if this is a v^Tite f>peration or receive data if this is a read 
operation. Wlien TACK clumges state all of the other SMCs 
and the MMC step up their queues because this access will 
soon be completed. 

Once the MMC chip supplies the write data to the data MUXs 
or receives the tlata from the data MLIXs on a read opera- 
tion, it con\pletes the transaction on ttie liimway bus. The 
memory system can have up to eight memory trar^sactions 
in progress at one titne and some of them can be overiapperl 
or paralleled at tJie same time. 

The timing for a single system memory access (idle state) hi 
a J/K-c!ass system breaks down its follows (me^tsuring from 
the begiiunng of the atkhess cycle on the Runway bus to the 
beginning of the first data cycle and n^easuring in Runway 
cycles): 

Cycles Operations during these Cycles 

1 Address cycle on Runway bus 

2 Address received ai MMC to address driven on MSI 
bus (artiially 1.5 or 2.5 cycles, each with 50% 
probability) 

2 Address on MSI has 

2 SMC address m to RAS driven to DRAJVIs 

6 RAS driven to output enable driven 

4 Output enable driven to fu^st data valid at data 

MUX input 
4 First data valid at data MUX input to data driven by 

data MbX on MSI bus. With EDO (extended data 

out) DRAMS this lime is redticed to 2 cycles. 
2 Data on MSI bus 

2.5 Data valid at MMC input to data driven on Runway 

bus (includes ECC, synchronization, etc.) 
25.5 Total cycles delay from address on Runway bus to 

data on Runway bus for a read operation. 

Register accesses to SMC chips are very siniUar to memory' 
accesses except tliat the register data values are transferred 
on the MSI address bus Instead of die MSI data bus. 



Board Design Challenges 

Early in the board design it became clear that because of the 
number of SIMMs and the physical space allocated to the 
i!iemor>' carrier board, the desijin would not work without 
some clever board layout and VLSI pinont changes, Mter 
several different boaid configLu^aUons (physical shape , 
SIMM plaeen^ervrs, tlirouglvhole or surface niotuit SIMM 
coimectorSj SMf /S and daUi MUX placements) were evalu- 
ated, tiie fmaJ coiifigination of 16 SIMMs, four data mult!- 
piexei's, and four SMCs on each of two boards was chosen. 

Given the very tiglit component spacing retiuired with this 
configuration, the pinouts of tlie data MUX and SMC cMps 
had 10 be chosen carefully. The pmout of tJie data MUX chip 
was chosen so that the RAM data bases from the SEVUVls and 
the MSI dat:a Inis to the comiector w^ere "river routable" (no 
trace crossing required). The pinout of tiie SMC chip was 
cbosen with the layout of the SIMMs in mind. It also had to 
be possil)ie (o mount both c:hjps on tlie backside of tiie 
board aiul still meet the routing requirements. Bemg able to 
choose chip pintJUts to suit board layout requirements is one 
of the matiy advantages of in-house custom cliip designs. 
Withotit this ability it is doubtful that this memory' carrier 
board configuration would be possible. Tliis is another 
exan\pie of closely coupled system design. 

One of the goals for this design was to have a board that 
could be customer shippable on first release (no functional 
or electrical bugs). To meet this goal a lot of effort was 
placed on sinu dating the operating enxironment of the mem- 
ory' siibsy.sten). By doing these simulations, both SPICE and 
functional (A^erilog, etc. )j electrical and functional problems 
were found and solved before board and chip designs were 
released to be bnilt. 

For example, major electrical cross talk [iroblen^ were 
avoided through the use of SPICE simulatioris. Li one case^ 
four 72-bit buses ran the length of the boaid (about 10.5 
inches) in parallel. Each trace was 0.005 inch wide and the 
traces were spaced 0.005 inch apart on tlie same layer (stan- 
dard PCB design mles) with only 0X1048 inch of vertical sep- 
aration bet we ei 1 1 aye rs . Fi ve-w i re m u t ual I y € o up I e d n i en lo ry 
carrier boarti anxi SIMM board models for SPICE were 
created using HP VLSI design tools. ^\Tien tliis model set 
was sinuilated, the electrical cross talk was shown to be 
greater than 60^^ and would have required a m^or board 
redesign to fix when found alter boaid release. The solution 
was to use a 0.007-inch minimum spacing betw^een certain 
wiies and to use a nonstandard board layer stack construc- 
tion that places a ground plane between each pair of signal 
layers. 

The memory carrier board uses several unusual technolo- 
gies. FV)r example, the board construction (see Fig. 5) is de- 
signed to reduce interlayer cross talk between the RD_A aiid 
RD_B data buses. As a result of tliis board layering, tlie char- 
acteristic impedance of nominal traces is about 38 oluns for 
both the inner imd outer signal layers. The nominal trace 
width for both inner and outer signal layers is 0.005 inch 
with 0,005-inch spacing on the outer layers and 0.007-mch 
spacing on the Inner layers to rethice couphng l>et^vecn long 
parallel data signals. 



50 Febniaiy 19^ Hewleti-Pacl^rd Joumitl 



)Copr. 1949-1998 Hewlett-Packard Co. 



Side1 SigiisLT 



iidei Sign 



Side 2 p0werJGNO 



Side! SigimLZ 



Sid«5 Power 3VDL 



Side 10 Sip3l_5 



Siifell Powcr_6G»l} 



Side 12 Signal J 



"2 Bsse Larninate 0.<iO45 ± 0.001 in 
Gondii ctof 
[ . 1 Prapreg 0,024 ± 001 in 

Fig. 5. Mniited circuil board fioiisiructiorL 



Side 4 


Pawm^ZQHU 




Sides 


SigneL3 


f ^ 


t 



Side? 


P(jwer_4 VDO 




1 1 


I - 1 


Sides 


Sipal_4 


_ ^ 




Sides 


Power_S GND 




1 I 


1 I 



Pfspreg, t Sh«e1 



Cure 



Pf ef>reg, 2 Sheets 



Care 



Prepret, 2 Sheets 



Cofa 



Prepreg, Z Sheets 



Core 



Prepreg. 2 Sheets 



J Core 



Prepreg. 1 Sheet 



TTie early SPICE and functional simulation effort paid ofiF. 
No electrical or fimctionai bngs were found on the memory 
carrier board, aOowing the H^D revision 1 board design to 
be released as manufacturing release revision A. 

Another new technologv^ used on the memor>^ carrier board 
is tiie BERG Micropax connector system. The Mic ropax con- 
nector s\'stem was selected because of its controlled imped- 
ance for signal interconnect and the large nun^ber of con- 
nections per inch. However, for these very same reasons the 
connector system requires extremely tight tolerances when 
machining the edge of the board containing the coiinectors. 

A new mamifacttiring process used by the memory carrier 
board, the HP 2RSMT board assembly process, was devel- 
oped to allow the surface moimting of extra-fine-pitch \US1 
components on botli sides of the board along wiih the 
through-hole SI!^1M connectors. 

Ackno wledgm en Is 

Development and manufacture of the J/K-class memory^ sys- 
tem was cmTicd out at niuitiple sites witliin HP. The numer- 
ous contributors are thatiked for tlieir dedication and hard 
work in the timely c^onipletlon of this highly successful 
memory system* At Chelmsford, Massachusetts: Riiya Atac, 
Dave Chou. Shawn Claj^on. Bill Clouser, Uaji Danecki, Tom 
Dickson, Tom Franco. Craig Fiiiik. Joy Har^, l^aiiy Kay lor, 
Milton Makris, Dale Montrone. Jack Murphy, Mike Ricchetti. 
Bob Ryan, Dan Schimiacher, Jiitt Williams. Jolm Wood, Steve 
Zedeck. At Cupertino. Callfomia: Tak Watanabe, Ed Jacobs. 
EUo Toschi. Sudliir Cliantlratreya, John Youden, Margaret Att, 
In Puerto Rico: Rodolfo Mazo. Roberto Falcon, Enid Dahlia, 
Edgar do Aviles, Joel Vientos. At Guadaliyarat Mexico: 
Adolfo Gascojt, Marco Gonzales, Ruben Kleinian, Manuel 
RiricoiL At Con^aliis, Oregon: Pete Fricke, Kathleen Kollitz, 
Tnmg Vu-Vu. At Roseville, California: Dbh Cromwell, Matt 
Harline, Hani Hassoun, Andy Levitt. 



Kc'bn icirv' 1 99fl riewlett-Packard Joum^ 5 1 



)Copr. 1949-1998 Hewlett-Packard Co. 



Hardware Cache Coherent 
Input/Output 

Hardware cache coherent I/O is a new feature of the PA-RISC architecture 
that involves the I/O hardware in ensuring cache coherence, thereby 
reducing CPU and memory overhead and increasing performance. 

by Todd J. Kjos, Helen N us bailing Michael K, TVaynor, and Brendan A* Voge 



^new feature, railed hardwai'e cache coherent I/O, was 
in;t3r6dncec3 into the HP i^A-RISC architecture as part of the 
HP 9000 J/K-class program. This feature aHows the I/O hard- 
ware to participal e in the system-defined cache coherency 
srheine. thereby offloading the nieinoo" sysieni ainri proces- 
sors of unnecessao' overhead imd contilbuting to greater 
system peifomaance. This paper reviews I/O data transfer, 
introduces the concept of cache coherent I/O from a hard- 
ware perspective, discusses the implications for HP-UX''' 
software, illustrates some of the benefits reahzed by HP's 
networking products, and presents measured performance 
results. 

I/O Data Transfer 

To understand the mipact of the HP 9000 J/K-cIass coherent 
I/O implementation, it is necessary to take a step back and 
get a high-level liew of how^ data is transfened between I/O 
devices and main memory on HP-UX systems. 

Tliere are two basic models for data transfen direct memory' 
access ( D^L'\) and progiammed I/O ( PIO ). The difference 
between the two is that a DMjV transfer takes place without 
assistance from the host processor while PIO requires the 
iKJSt processor to move the data by reading and writing reg- 
isters on tlie I/O device. DMA is tyiiically used for fievices 
hke disks and LANs wtuch move large amounts of data and 
for wliich perfoiiTiance is important. PIO is tjT3icaIly used 
for low-cost devices for w^hich performance is less impor- 
tant, like RS-232 ports. PIO is also used for some high-per- 
formance devices like grapliics frame buffers if the program- 
ming model requires it. 

All data transfers move data either to main memoi^ from an 
I/O device (inbound) or from main memory to an 1/0 device 
(outbound). These transfers require one or more transac- 
tions on each bus between the I/O device and main memory. 
Fig. 1 shows a tj^jical PA-RISC system with a two-level bus 
hierarchy. PA-RISC piocessor-to- memory buses t^^pically 
support trmisactions in sizes that are powers of 2, up to 32 
b>tes, that is, READ4. WRITE4, READS, WRITES, READ 15, WRITE 16, 
READ32. WRiTE32, where the niunber refers to the nrmiber of 
bytes in ihe iransaciion. Each transaction has a mtister and 
a slave: the master initiates the transaction and the slave 
mirst respond. Write trar\sactions move data from the n\aster 
to tire slave, and read transactions cause the sla\'e to re- 
spond v\ith data for the master. The processor is always the 
master for PIO nansactions to the I/O cievice. Air I/O device 
is always the master for a DIVtA transaction. For example, if 



a softwrare device driver is reading (PIO) a 32-bit register on 
the fasXJwiiie SCSI device showm in Fig. i, it causes the pro- 
cessor to master a READ4 transaction to the device, which 
re.sults in the VO adapter mastering a READ4 transaction on 
the I/O bus, where the fast/wide SCSI device responds witii 
the four b>l es of data. If the Fibre Ch^mnel Interface card is 
prograirrmed to D]VL\ transfer 4K b^les of data from memory 
to the disk, it will master 128 READ32 transactions to get the 
data from memorj^. Tlie bridge forw^ards transactions in botlr 
directions as appropriate. 

Because PIG transactions are not in memory address space 
and are tlrerefore not a coherency concern, the rest of this 
article discusses DMA transactions only. The coherent 1/0 
hardware has no impact at all on I/O software device drivers 
that interact with devices via PIG exclusively. 

Hardware Implications 

Cache memory is defined as a small, liigh-speed block of 
memory^ located close to the processor. On the HP PA 7200 
and PA 8000 processors, a portion of the software virtual 
adchess (calletl the virtual index) is used as the cache 
lookup. Main memory is nnich larger and slower than cache. 
It is accessed using pliy steal addresses, so a virtual-to-physi- 
cal address trarrslation must occm^ before issuing any re- 
quest to memory. Entries m the PA 7200 tmd PA 8000 caches 
are stored in lures of 32 bytes. Since data tlrat is ref ereitced 



B BD 



Processor' Mem Dry Bus (Rum way) 




f^OeusiHP-HSn 



l/OBi*sfHP-HSC) 



T 




fastAATiile 
SCSI 




Hi 



I 



Fig. 1, T>-plcaI PA- RISC system with a two -level bus hierarchy. 



5 2 Februao' 1^&6 Hewlett-Packard Journal 



)Copr. 1949-1998 Hewlett-Packard Co. 



once by the processor is likely to be referenced again and 
neaj-by data is likely to be accessed as well, the line size is 
selected to optimize the frequency' with wMch it is accessed 
while niinimizing the overhead associated i-^ith obiaining the 
da.ta from tnaii^ memory: The cache contains the most re- 
cently accesse<i lines, thereby maximizing the rate at whicii 
processor-to-menior>" requests are intercepted, ultimately 
reducing latency. 

V^lien a processor requests data (by doing loads), the tine 
containing the data is copied from main memoiy into the 
cache. When a processor modifies data (by doing stores), 
the copy in the cache will become more up-to-date than the 
copy it^ meniorv. HP 9000 J/K-class products resohe this 
stale data problem by using the snoopy cache coherency 
scheme defined in the Rmiway bus protocol. Each processor 
monitoi-s all Rim way transactions to detemiine whetlicr the 
'^irtuaj index requested matches a line currently stored in its 
cache. TliLs is calletl "snooping ihe btts." A Runway proces- 
sor must own a cache line excliisiveiy or pnvatelij before it 
can complete a store. Once the store is complete, the cache 
line is considered diHy relative to the stale memoiy copy. 
To n\aximize Rmiway bus efficiency, processors are not re- 
quired to write tills stale data back to mcmor>^ immediately 
Instead, the write- back operation occurs when the cache 
line location is required for use by the owning processor for 
another men^ory access. If, following the store but before 
tiie vtTite-back. anotlter processor issues a read of this cache 
line, die owning processor will snoop thb read request iuid 
respond with a cache-to-cache copy of the updated cache 
line data. This data is then stored in the requesting proces- 
sor's cache and main memory 

Since the 1/0 system must also read (outijut) and modify 
{input ) memoiy data via DM.\ transactions, <iaia consis- 
tency for the I/O system must be eiLsured as well. For exajti- 
pie, if the I/O system is reading data from memoiy (for out- 
bound DJVL\) that is cunently dirty in a processor's cache, it 
mtisi be* prtnented from obtaining a stale. out-<jt-date copy 
froiTi memoiy. Likewise, if die I/O system is wTiring data to 
memory (for inbound DMA), it must ensure that the proces- 
sor caches acquire this update. The optimum solution not 
only maintains consistency by performing necessary' input/ 
oiUfml operations while preventing? the transfer of any stale 
copies of data^ but also niininiizes any interference with 
CPU cycles, which relate directly to performance. 

Cache cohemnce refers to this consistency of memory ob- 
jects between processors, memoiy modules, and I/O de- 
\ices. liP J:K)00 systems without coherent I/O hardw^are must 
rely on soft w^al e to maintain eache coherency At the haid- 
ware level, the 1/0 devices \iew^ of memory is different from 
the processor's because requested data might reside in a 
processor's cac!\e. TM^^t-^IlVt i>rocessor cai^hes are virtually 
indexed while I/O devices use physiciil addresses to access 
memory'. Hence there is no way for I/O devices lo participate 
in tlve processor s colierency protocol without additional 
tmrdware support in the I/O adapter. 

Some architectures have prevented stale data problems by 
imttlementing i>hysically iiKlexed cacltes, so that it is the 
physir-al indt-^x, nor the vimial index, that is snooped on the 
Inis. Thus, the I/O system is not rcfiuired to perform a physi- 
cal-to-virtual address triinslatiiin lo t>articipate in the snoopy 
cohererv(*e protocol. On the HP 90(10 J/K-class products, we 



chose to Lmplement virtually indexed caches, since this 
ininimizes cache lookup time by eliminatiitg a virtiiaJ-to- 
physical address translation before the cache access. 

Other architectures have avoided the output stale data prob- 
lem by implementing wTite4hrough caches in w hich aJl pro- 
cessor stores are immediately updated in both the cache and 
the memory. The problem with this approach is its high use 

of prrK"essor-to-memor>' bus bandwidth. Likewise, to resoh^e 
the input stale data problem, many architectores iitiow 
cache lines to be marked as micacheable. meaning that they 
can never reside in cache, so main memon- will always liave 
correct data. The problem with this approach is that input 
data must first be stored into this uncacheable area and then 
copied mto a cacheable area for the processor to use it. This 
process of copying the data again consimies processor-to- 
niemor>' cycles for nonuseful w ork. 

Pre\ious implementations of HP s PA-RISC processors cir- 
cumvent these problems by making caches visible to soft- 
ware. On outbound DMA, the soft^^'are VO device drivers 
execute flush data cache instnictions inunediately before 
output operations. These instructions are broadcast to all 
processors and require them to flush their caches by writing 
the specified dirty cache lines back to main memorj-. Mter 
the DMA Iniffer has been Oushed to main memory, the out- 
bound operation can proceed aitd is guaianteed to have the 
most up-to-date data. On inbound DMA, the software I/O 
device drivers execute broadcast pmge data cache instruc- 
tions just before the mput operations to remove the DMA 
buffer from till processor caches in the macliine. The PA- 
RISC architecture's flush data cache and purge data cache 
ii^struction ov^erhead is small compared to the perfomtance 
impact incurred by tltese other schemes, and die I/O hard- 
ware rental ns ignorant of tJie complexities associated with 
coherency 

I/O Adapter Requirements. The I IP 9000 J/K-class products 
kuid the gcncratioji ni i>ro<'cssors they aie designed lo sup- 
port, place greater dcmHiids on the I/O hardware system, 
ultimately reqtnring the itnplementatioit of cache cohereitt 
1/0 haj'dvvare in ttie 1/0 l)us adapter, which is the btis con- 
verter between the Runway i>rocessor-niemoiy bus and t±te 
HP-HSC I/O bus. The tlrst tjf these dem^mds for I/O hard- 
ware cache coherence came fron; the design of the PA 7200 
and PA 8000 prf>cessors and theu' respective imjjlementa- 
tions of caclie prefetclung and speculative execution. Tlie 
introduction of these features would have required software 
I/O device drivers to purge mbound buffers twice, once be- 
fore the DMA and once after the DMA completion, thus 
doubling the iierfoniianc e penalty Because aggressive pre- 
fetches into the DMA region could have accessed stale data 
after the piurge but before the DMA, the second purge would 
have l>een necessary' after DMA completion to cleanse stale 
data prefetch buffers in the processor. By designing address 
translation capabilities into the I/O adapler, we enable it to 
pailicipate in the Rim way snoopy protocol. By generating 
virtual indexes, the I/O adapter enables jjroccssors to com- 
pare tuv.] detect collisions widi current caclie addresses and 
to inirge prefetched data aggressively before it l)e comes 
stale. 

j\nother demand on tlie I/O adapter t^itshing it in the direc- 
tion cjf 1/0 cache coherence ciunc* from die IIP 91)00 J/K-class 
meniory controller It was decided that the new memoiy 



)Copr. 1949-1998 Hewlett-Packard Co. 



Fetoniai^^ 1S©6 Kewiett-Paekart! Joun m I 53 



controller would not implemeni: writes of less than four 
words. CniesG tyi^es of wTltes would have reqiiirpd read- 
modify-wTite operations in the DRAM array, wfiicli have long 
cycle times and, if executed frequently degrade overall main 
memory j)erfonnance.) Becaase one-word writes occur in 
the I/O system, for registers, semaphores, or short DMA 
writes it was necessary that tlie VO adapter imi>lenient a 
one-line-deep cache to buffer cat he lines, so that these one- 
word writes could be executed by perfomung a coherent 
read private transaction on the Runway bus, obtaining the 
most recent copy of the cache line, modifying it locally in 
cache, and finally wxiting the modified line back I u main 
memory. For the I/O adapter to support a cache on the Run- 
way buSj it has to have the ability to compare processor-gen- 
erated virtual address transactions with the address con- 
tained in its cache to ensure ttiat the processors always 
receive the most up-to-date data. 

To simplify tlie design, the I/O adapter implements a subset 
of the Runway Inis coherency algoridim. K a prot^essor re- 
quests a cache line currently held privately by the I/O 
adapter, tlie I/O adapter stalls the coherency respcinse, fi- 
nishes the re ad-modify- write sequence, WTites the cache line 
back to memory, and then responds \^^th COH_0K, meaning 
that the I/O adapter does not have a copy of this cache line. 
This was much simpler to implement than the processor 
algorithm, wbich on a conHift respomls with COH_CPV, mean- 
ing ll^at the processor has this cache line ajid will issue a 
cache-to-cache copy after modifying the cache line. Since 
the I/O adapter only has a one- line cache and short DMA 
register and semaphore writes are infrequent, it was felt that 
the simpler algorithm would not be a performance issue. 

A final requirement for tlie I/O adapter is that it handle 32-bil 
I/O \1rtual addresses on the HP-HSC hiLs and a larger 40-bit 
physical address on the Runway bus to support the new 
pro cess oi^s. Tlius a mapping function Ls required to convert 
all HP-HSC D^L\ tTimsactions. This is done via a lookup 
table in memory, set up i>y the operating? system, called the 
I/O page directory. With minor additional effoil, I/O page 
direetoiy entries were defined to provide the I/O adapter 
with not only the 40 -bit physical address, hut also the soft- 
ware \irtual index. This pro\ides all the infonnation neces- 
sary for the I/O adapter to be a coherent client on the Run- 
way bus. The I/O adapter team exploited the mapping 
process by implementing an on-chip TLB (translation looka- 
side buffer), which caches the most recent Lnuislations, to 
speed ifie address (conversion, and by storing additional at- 
tribute bits into each page direcior>- entr>^ to provide clues 
about the Di\b\'s page destination, thereby allowing furtJier 
optimization for each HP-HSC-to-Runway transaction. 

I/O TLB Access. The mechanism selected for accessing the I/O 
TLB ijoi U Jiiijiiniis^es the potential for tJirashing and is flex- 
ible enough to work with both large ai^d small I/O systems. 
[Tlu'ashing occurs when two DMA streanrs use the san^e 
TLB RAM location, each D^L4 transaction alternately cast- 
ing the other out of the TLB, resulting in treaieudous over- 
head and lower perfomumce.) Ideally; each DMA stream 
should use a different TLB RAM location, so that only one 
TLB miss read is done per page of DMA. 

We implemented a scheme by which the upper 20 bits of tiie 
1/0 \Trtua] address are available to be di\ided into a chain ID 
and a block ID (Fig. 2). The lower 12 bits of tlie address 



HP HSCBtis Address 

(I/OAdapteMmerRat 

Numbering^ 



Page TYpe. 
Sequemia] 



Runway 

Bu& 

Address 




Virtual Index Pltysical Page Number OffsM 

27 28 35 3fi 37 38 3S 



Fig, 2, TLB traiisJariou -schema. Tlie uppor address bits (c^haiJi ID) 
of tiie 1/0 virtual at J dress are used t.Q acness die TLB RAM, and the 
rf^maiuder of the I/O virtual address ("block ID) is used to verify a 
TLB hit (as a tag). 

must be left alone because of the 4K-byte page size defined 
by the aiTluterture. The upper address bits (chain ID) of the 
1/0 viituaJ address are used to access the TLB RAM, and tlie 
remainder of the I/O \ditual addiess (block ID) is used to 
verify a TLB hit (as a tag). This allows software to assign 
each DRL\ stream to a different chain ID and cacti 4 K -byte 
block of the DIVLA to a different block ID, thus mininiizing 
thrashing between DMA streams, 

A second featuie of this scheme is that it helps limit the 
overhead of the 1/0 page directorj^. Recall that the 1/0 page 
directoi^'^ contains all active address trEmslations and must 
be memoi^^-resident, I/O page directory size is equal (o the 
size of one enlr^^ limes 2^, where k is tlie number of chain ID 
bits plus the number of block ID bits. The division between 
the chain ID and the block ID is progiammabLe, as is the 
total number of bits (k). so software cmi reduce the memory 
overhead of the I/O page director>^ for systems with smaller 
I/O subsystetns if we gu;uantee that the leading address bits 
are zero for Oiese smaller systenLS. 

If the translation is not euiTently loaded in the 1/0 adapter's 
I/O TLB, the 1/0 adapter reads the transladon data from the 
I/O page directory and dien proceeds with the DIVIA. Servic- 
ing tlie TLB miss does not reijimT processor inlen^ention, 
altlrough the I/O page directory entry must be valid before 
initialing 1 be DMA. 

Attribute Bits. Each mapping of a page of memory has attrib- 
ute bits (or clues) that allow some control over how the 
DMA is performed. The page attribute bits contiol the en- 
abling and disabling of prefetch for reads, the enabling and 
disabling of atomic mode, and the selection of fast D^LA or 
safe DJVIA. 

Prefetch enable allows the i/0 adapter to prefetch on DMA 
readSj thus improving outbound DMA perfonnance. Because 



54 February 1996 He^'^lett-Packard Jotrni;^] 



)Copr. 1949-1998 Hewlett-Packard Co. 



tlie I/O adapter does not maintain coherency on prefetch 

daia. software must only enable prefetching on pages where 
there will be no conflicts vTith processor4ieM cache data 
Prefetch enable has no effect on inbound DMA writes* 

Atoniic or locked mode allows a D^L4 transfer to own all of 
memor>'. UTiile an atomic mode transfer is in progress, pro- 
cessors cannot access main memorj' This feature was 

added to suppon PC buses ihai allow kK^kiiig (ISA. EISA). 
The HP-HSC bus also suppons this fiinctionalit>'. In almost 
all cases, atomic mode is disal>led, because it has d"emen- 
dous performance ejects on the rest of the system. 

The fast/safe bit oiily has an effect on half-cache-line Di\LA 
wTites. However, many I/O devices issue this type of 16-b>le 
vmte transaction. In safe niode. the v^Tite is done as a read- 
modify -\\Tite transaction in the ^O adapter cache, which is 
relatively low in performance. In fast mode, the write is 
issued as a WRITE16_PURGE transaction wluch is interpreted 
by the processors as a piu'ge cache line transaction and a 
v^Tile iudf cache line transaction to memory^ The fasfsafe 
DMA attribute is used in the following way. hi the middle of 
a long inbound DMA. fast mode is used: the processors 
caches are purged while DMA. data is moved into memory. 
This is acceptable l*e cause the processor should not be 
modifying any cache lines since the DM\ data woLild o\'er- 
wTite the cache data anyway, Howe^"e^J at the beginning or 
end of a DMA Dansfer, w^here the cache line might be spilt 
between the DMA sequence and some other data unrelated 
to the Di^LA sequence, the DxMA iransac tiou iteefls lo pre- 
serve \h\s other data, which might be heki private-dirty^ by a 
processor In such cases, the safe mode is used. Tliis feattire 
allows the vast nuyority of IG-byte DMA writes to be done as 
WHITE 16_ PURGES, winch have much better performance than 
read-ri\odify- writes intemal lo the I/O adapter cache. This is 
the only half-cache-line transaction the memory subsystem 
supports. All other memory trajisactions operate on full 
cache lines. 

HP-UX Implementation 

Cache coherent I/O affects HP-UX I/O de^'ice diivers. 
Although the specific algorithm Is different for each soft- 
ware I/O de\1ce driver that sets u[} DMA trimsactions. the 
basic aJgorithm is the same. For outboimd DMA on a system 
without coherent I/O hardw^aie, the driver must perlbmi the 
following tasks: 

• RiLsh t±ie data caches lo make sure memory is consistent 
with the processor caches. 

• Convert tlie processor virtual address to a physjrnl address, 

• Modify and flush any control structures shared between the 
driver and the I/O de\ace. 

• Initiate the DMA transfer i^y programniing the device to 
move the data from the given physical address, ushig a 
fie\1ce-specific mechanism- 

For inboimd DMA the algorithm is similar: 

• Purge tiie data cache to prevent stale cache data from being 
w-ritten over DMA data 

• C(mvert tbe processor \irtiial address to a physical address. 

• Modify ami flush any contrf>l stnuiiires shared Ix^tween tjie 
driver and tlie I/O tle\ice. 

• Initiate the DMA transfer liy programming the tievice to 
move tiif' data Icj the gi\'en physical ad<in*ss^ using a device- 
specific algorithm. 



\^Tien the DMA completes, the device notifies the host pro- 
cessor \1a an interrupt., the driver is 'mvoked to perform any 
post-DMA cleanup, and higli-level software is notified that 
the transfer has completed- For inboimd DMA the cleanup 
may include purging the data bulTer again in case the pro- 
cessor has prefetched the old data 

To support coherent I/O hardware, changes to this basic 

algorithm coidd not be avoided- Since coherent VO liard- 
ware translates '32-bit I/O \iJlua] addresses into processor 
physical addresses tltat may be wider than 32 bits. I/O de- 
vices must be progranmied for DMA to 1/0 virtual addresses 
instead of physical addresses. Also, since coherency is mam- 
tained by the coherent I/O adapter^ no ca<'he fliislTes or 
purges are necessary^ and should be avoided for performance 
reasons. To aDow drivers to fimction properly regardless of 
W'hether the s>^stem has coherent !/0 hardwai^e. HP-l'X ser- 
\ices were defined to hartdle the differences transparently. 
There are three main senices of interest to drivers: mapO is 
usefi to convert a \irtnal address i^ange into one or more I/O 
virtual addresses. jnmapO is used to release the resotirces 
obtained via map{) once the transfer is complete, and 
dma_svnc() is used to sjaichronize the processor caches with 
the DMA buffers, replacing explicit cache flush and purge 
services. These services are discussed in more detail below* 

Drivers had to be modified where one of the following 
assuntptioi^ existed: 

• Devices use pivcessor physical addj'esses to access memory. 
This assumption is still true for noncoherent systems, but 
on MP 9000 J/K-ciass systems I/O virtual addresses must be 
used. Tlie mapO service trajtsparentiy returns a physical 
address on noncoherent systems and an I/O virtual address 
on coherent systems. 

• Co die management must be perjortned by softwan^^ This is 
still tnie for noncoherent systems, hut on coherent systems 
OiLshes and pmges should be avoided for perfonnant^e rea- 
sons. The dma_sync(| service performs tfie appropriate cache 
syn<:hroniz^ition hmctions on noncoherent systeats but does 
not Hush or piu-ge on coherent systems. 

• TJm dnver dom not k<we to keep tnick of ant/ DMA re- 
saure^^s. Drivers now have to remember what I/O virtual 
addresses were allocated to them so they can call jnmspO 
wiien the DMA transfer is comiJlete. 

To accununodate these necessaiy' mod iHrai inns, the sr)ft- 
WiU'c tnodel for DM\ setup has heeit eliang*'d lu: 

• Synchronize the caches using clma_sync(). 

• Convert the processor \1rtual address to an I/O virtual 
address tising the mapO service. 

• Initiate the DMA transfer via a flevlce-specific mechaJiism. 

• Call the unmapO service to release DMA resources when tlie 
DMA transfer is complete. 

On noncoherent systems, this has the saitie effect as before, 
except that the driver doesn't know^ w1^ ether or not the 
cache was actually Sushed or whether the I/O virtual atldress 
is a physical address. 

For drivers that rely entirely on existing driver services to 
set up DMA buffers (lik^ most KISA ckivers), no chimges 
were needed since the uttderlying services were modified to 
su[}port coherent 1/0. 



February 11166 1 f cwlett-Pankard .loi i mid 55 



)Copr. 1949-1998 Hewlett-Packard Co. 



Driver Servrces: map and ur»map. The rnapd semce and its vari- 
ants are \\w only way to ui>iain an VO viitual address for a 
given nieinoi^^ object. Drivers cannot assume tiiat a buffer 
can be mapped in a single call to mapO onless the buffer is 
aligned on a cache line boundary and does not cross any 
page boundaries. Since it is possible that multiple I/O virtual 
addresses arc needed to map a DMA buffer compleiely^ mapO 
should be called witliin a loop as shown in tlie following C 
code fragment: 

/* Function to map an outbound DMA buffer 

* Parameters: 

* isc: driver ccmrol struciurti 

* spa cejd: the space id for the buffer 

* V)rt_addr: the virtual offset for the buffer 

* buffer J ength: the length (in bytes) of the buffer 

* iovecs (output); an array of address/length 1/0 

* virtual address pairs 

* Output: 

* Returns the number of address/length I/O virtual 

* address pairs ('Hf error) 



V 

int my_bufter_mapper(isc,sp3ceJd,virt_addr,buffer_lengthjDvecs) 

struct iscjable_type *isc; 

mtvec_cnt; 

struct lovec ^'iovecs; 

f 

int vec_cm = 0; 

struct iovec hostvec: 

int retvaL 

r FJush cache (on noncoherent systems) 7 

dma_sync(spa c ejd,vrii_add r.buffer Jength, I O^S YN C^FORCPU): 

/" Setup input for mapf) 7 
hostvec->mv_base = virt^addr; 
hostvee->tavjBn = hufferjength; 



do{ 

r Map the buffer 7 

retval = wsia„map(isCrNULU0,sp3ce_id,virt_addr, 
&hostvec,FDvecs}; 

if[retval >= 0){ 

r Mapping was successful point to the 

* next I/O virtual address vector. Mote: 

■* hostvec was modified by mapO to point to 

* unmapped portion of the buffer. 
7 

vec_cnt++; 
icivecs-^+: 
} 
} whrle(rfitval > 0); 
/* Check for any errors 7 
if [retval<0){ 
while fvec_cnt) { 

wsio_urfmap(isc, iavec[vec_cntl.iov_base ); 
vec_cnt — : 
> 
vec_cit — ; 
} 
return(vec_cnt|; 



} 



In this case the rnapd variant wsm_mapO is used to map a 
buffer When a mapping is successful, the driver caii expect 
that the virtual host range structure has been modilied to 



poijit to tJie unmapped portion of the DMA buffer The 
wsio^mapO senice Just converts t!ie isc parameter into the 
api^ropriaie token esqiected by mapO to fmd the control 
structure lor the correct I/O page directory. The calling con- 
vention for mapilis; 

map(iDken,map_cb,hintSiranQe_ty pe,host_ra nge jo_range) ^ 

where token is an opaque value that allows map() to find the 
bookkeei)ing structures tbr the conect I/O page direrioiy. 
The map_cb parameter is an optional parameter that allows 
ma pi I to store some state information across invocations. It 
is used to optbnize the default I/O virtual address allocation 
scheme (see below J. The hints parameter allows drivers to 
specify page attributes to be set in the I/O page direct oiy or 
to inhibit special handling of imaligned D^LV buffers. Tire 
host_range coir tains the \ditua] address and length of tire 
buffer to be mapped. As a side effect, the hDst_range is modi- 
fied by mapil to point to the unmapped portion of tire buffer 
(or the end of the buffer if the entire range was mapped). 
The io_range is set up by mepO to indicate the I/O \irtrral 
addrr^ss and the length of the buffer that was just mapped. 
The r3nge_type is usually tire spare ID for the \lrtrial addresSj 
but may indicate that the buffer is a physical address. 

All I/O virtual addresses allocated v4a mspO must be deallo- 
cated via unmapO when they are no longer needed, either 
because theiT was an en'or or because ihe DftL4 completed. 
Tire calling convention for unmapO is: 

unma pttoken ja_range ) . 

The mapO service uses the following algorithm to map mem- 
ory objects into I/O \irtoal space: 

• Allocate an I/O vinual address for the mapping. 

• Initialize the I/O page directory entry* with the appropriate 
page attributes. Tlie page directory eiitr>' will be brought 
urto the I/O translation lookasirle buffer when Uiere is a 
miss. 

• Update the callers rairge structures and retunr die number 
of bytes left to map. 

The first two steps £rre discussed separately below, 

I/O Virtual Address Allocation Policies. As mentioned above, if 
several DM\ streams shaie a chain \l). there is a risk that 
perfomrance wiD suffer significani ly because of thraslnng. 
Two allocation schemes that appeai'ed to elinrixrate thrashing 
are: 

• Allocate a unique chain ID to ever>^ D1VL4 stremu. 

• Allocate a luiique chain ID to every HP-HSC guest. 

In the I/O adapter there are a total of 256 1/0 translation 
lookaside buffer entries, and therefore there are 256 chain 
IDs to allocate. Tire first allocation scheme is unrealistic 
bee arise niimy more than 256 D]VL\ strearrrs can be active 
from the operating systems pomt of \iew. For instarrce, a 
single networking card can ha^e over 100 buffers available 
for mboimfi data at a given time, so mth oirly three network- 
ing cards tlie entire I/O TLB woidd be allocated. Thrashing 
isn't really a problem utrless Individual trarrsactions are in- 
terleaved uith tire sanre cloaur ID. so on the surface it may 
appear tiiat the second allocation scheme woiUd do the trick 
{since most devices interleave different DjVL^ sireanr^ at fairly 
coai^e granularity like 512 or IK bvies). I'nfoitrmately. the 
second scheme has a problem iir that some devices (like SCSI 



56 



PebnMTF 1096 HeWleft-P&Ckard Joumfil 



)Copr. 1949-1998 Hewlett-Packard Co. 



adapters) can have many laige UM.\ buffets, so aU ctinent 
outstandmg D^L^ streams cannot be mapp€*d into a single 
chain ID. One of the goals of the design was to niininiize the 
impact on drivers, and many drivers had been designed with 
the assumption tliat there were no resource allocation prob- 
lems asocial ed with setting up a Di\L\ bi*0^er. Therefore, it 
was unacceptable to fail a mapping request because die 
drivers chain ID contained no more free pages, The book- 
keeping involved in managing the fine details of the indiAid- 
ual pages and handling overflow cases while guaranteeing 
that mapping requests would not fail caused us to seek a 
solution that would niinimize (rather than eUmkiate) the 
potential for thrashing, while also minimizing the bookkeep- 
ing and overhead of managing the chain ID resource. 

^MxBX we fmalfy came up vtith was two allocation schemes: a 
default I/O \1rtual address allocator which is well-suited to 
mass storage workloads (disk. lape. etc. ) and an alternate 
allocation scheme for networking-like workloads. It was 
observ^ed early on tliat there are some basic differences in 
how* some de\ices behave with regard to DMA buffer man- 
agemetit. Networking dri\'crs tend to ha\T many buffers 
allocaied for inbound D^L^ but devices tend to access them 
sequemialiy. Tlierefore, networking devices fit the model 
vei7 well for the second allocation scheme listed at tlie be- 
ginning of this section, except that it b likely that multiple 
chain IDs will be necessary for eacli device because of tiie 
number of buffers tJiat must be mapped at a given time. 
Mass storage devices, however, may have many DMA buff- 
ers posted to the device, and no assumptions can be made 
about the order in wliich the buffers will be used. This be- 
havior was dubbed nonsequenfiaL It would have resulted in 
excessive overhead in managing the individual pages of a 
given chain ID if the second scheme listed at the l>eginiiing 
of this section liad been in^pi^ntented. To satisfy the require- 
mentis for both setiuential and nonseciuential devices, it was 
decided to manage rirfuftl chain IDs calied mngt^j^ instead 
of chain IDs. This allows the ope j'a ting system to manage 
the resource independent of the physical size of the I/O 
translation lookaside buffer. Thrashing is minimized by al- 
ways allf jcating free ranges in order, so that tlu-aslilng c^atv- 
not occur unless 256 ranges are already allocated. There- 
for e^ softwaie lias a slightly different \iew^ of the I/Q vdrtual 
address than the liai'dware^ as shown in Fig. 3. 

With this defmit ion of an I'D virtual address, software is not 
restricted to 256 resources, btit instead can configure the 
nimiber of resources by acljusting the number of ranges per 
chain ID, For the HF-l\X H),0 implementation, there are 
eight pages per range so that up to 32K b>1es can be mapped 



with a single range. "Hie main allocator keeps a bitmap of all 
ranges, but does not manage indi\idual pages^ 

A mass storage dri^'er will have one of these ranges allo- 
cated whenev^er it maps a 32K-byte or smaller bu0er (assum- 
ing the buffer is aligned on a page). For large buffers ( > 32K 
bytes), several ranges \nU be allocated. \^lien the DMA 
transfer is complete, the driver unmaps the buffer and the 
associated ranges are retiimed to the pool of free resources- 

All drivers use the default allocator unless they notify the 
system that an alternate allocator is needed. A driver service 
called set_attributes allows the dri^ er to specify that it con- 
trols a sequential device and therefore should use a different 
allocation scheme. In the sequential allocation scheme used 
by networking drivers, the driver is given permanent owner- 
ship of ranges and the individual pages are managed (similar 
to the second scheme above). When a net^ orldng driver 
attempts to map a page and the I'anges it owns are all in use, 
the services use the default allocation scheme to gain ovtu- 
ership of another range, I'nlike the default scheme, these 
rajiges are never retiuned to the free pool. When a driver 
unmaps the DiL\ buffer, the pages are marked as available 
resources oi\ly for tliat driven 

Wlien unmapj) is called to umnap a D^LA bidler, llie appropri- 
ate allacation scheme is invoked to release the resource. If 
the buffer was allocated via the default allocation scheme 
then unmapO piu:ges the I/O TLB entr>^ using the I/O adapter's 
direct v^iite conmiarul to invalidate the entr>^. Tlie range is 
then marked as available. If the sequential allocation 
scheme is used then the I/O TLB Ls pmged using the I/O 
adapter's purge TLB conmaand each time unmapO is called. 

I/O Page Directory Entry Initialization, Once ma p() has allocated 
the appiopriate UO virtual a^ldress as described above, It 
wiH initialize the coiresponding entiy in the I/O page direc- 
tory. Fig. 4 shows the fonnat of an L^O page directoiy entry. 
To nil in the page directory- entiy, nrapO needs to know the 
physical address, the viriual index in the processor cache, 
whether the driver will allow the device to prefetch, and the 
page tjpe attributes. The physical address and virtual index 
are both obtained from the range_tvpe and tiost.range parame- 
ters by using the LPA and LCI iiistnictions. respectively. Tlie 
LCI (load coherence uidex) instniction was liefinetl specin- 
cally for coherent I/O services to detennine the virtual index 
of a memory object. The page type attributes are passed to 
mapf} via the hints parameter Hints that the driver can specify 
are; 
lO^SAFE. Clauses the safe page attribitte to be seL 



PPNtOJl virtual tndGKiOiltl 



Ptiysioal Page N4imber[4:79| 



m 



8 Bits lofljiNRN losi^NFI 12 Sils I^K-Byte Page) 



19 20 24 2SP20|2?3{r|3t 



ih) 



rogj^Nf) 12BitiHK-8ytePaga| 



NR ^ Number of Ranges per Chain ID 

IIP ^ Number at Pages per Range IB \n HP-UX 10,P} 

Ftg» 3. (ii) Hardwfue's (I/O TLB^s) view of I/O virttml addresses. 

(b) Saft.wart^'s view of I/O virtual adttresses. 



Physical Page »umber{20:Bl 



PPN[(I:3] = Physical Page NumbcrfgiSl 

R 

PH 

PA 

V 



= Reserved Field I AH Bits Are Set lo Zero in Reserved Fielda) 

1= PfefeicK Him Bil (EnabFe/Disatile Pj-efetch] 

= Page Attrihule Bits fDisable/EnabEe Alemie^ Fasl/Safe DMA] 

= Valid IndicalEUr 



Fig. 4. All I/O page directgr>' entry. 



Pp bru nry If >f 15 Hewjelt-Pafkarri Jourriti I 5 7 



)Copr. 1949-1998 Hewlett-Packard Co. 



• IO„LOCK. Causes the lock (atomic) page attribute to be set. 

• 10_N0_SEQ, Causes the prefoich enable page attribute to be 
cleared. 

Refer to ** Attribute Bits** on page 54 for details of how the 
I/O adapter behaves for each of these dri\ er liints. ( Jnce the 
I/O page direct oi^y has been milializedj the bufTei- <an be 
used for DMA. 

Other hints aro: 

• iO„iGN_ALlGlVMENT. NonnaJly the safe bit ^^ill be set automati- 
cally by map!) for buffers that are not cache line ahgned or 
til at are smaller rhan a cache line. This flag forces mapO to 
ignore cache line aligiunenl, 

• lO.CDNTfGUOUS. This flag tells mapl) that tiie whole buffer 
must be mapped m a smgle call. If map j} cajinot map the 
buffer contiguously llien an error is returned (tliis hint im- 
plies 10_IGN_ALIG FOMENT). 

Driver Impact Tiie finite size of the I/O pa|(|jSSiiliOr used 
for address trajislations posed some iiitgtieillSS^^^ 
for drivers^ 

Drivers must now release DMA resources upon DMA com- 
pletion. This requires more state information than drivers 
had previously kept. Additionally, some drivers had pre- 
viously set up many (thousands ) of memory objects, which 
I/O adapters need to access. Mapping each of these objects 
into the I/O page directory individually could easily consume 
thousands of entries. Finally, ibe default I/O page directoiy 
allocation policies would allocate several entries to a driver 
at a tinre, even if the driver orily requires a single translation. 

In the case where drivers map hundreds or thousands of 
small objects, the solution requires tlie driver code to be 
modified to allocate and nvap large regions of nxeinory and 
then break it down into snialler objects. For example, if a 
driver individually allocates and maps 128 32 -byte objects, 
this would require at, least 128 I/O page director>^ entries. 
iiowe\ cr, if the driver alloc ales one page [ 409G bytes) of 
data, maps the whole page, and then breaks it dowTi into 
128 32'byte objects, only one I/O page directory entry is 
required. 

Another solution is to map only objects for transactions that 
are soon to be sliuled. Some diivcrs have stalically allo- 
cated and mappeti many stmctureSt consiuuing large nimi- 
bers of VO page directory entries, even though only a few 
D]VL\ transactitnis were active at a t ime. Dynamically map- 
pmg and mmiapping objects on the fly requires extra CPU 
bandwidth and driver slate information, l>ui can substan- 
tially reduce I/O page directory* utiU7,ation. 

NetTworking-Specific Applications 

The benefits of the seletnetl hardwai'e 1/0 cache coherence 
scheme become evident when examhiing networking appli- 
cations. 

High-speed data conuiiuni cation hnks plat^e increased de- 
mands on system resources, mcluding CPU, biLs. aaul meni- 
ory, which must carrj^ and process tlie data. Processing the 
data thaf these links caii^ and the applications for wiiich 
they will be used requires that resomce utilization, mea- 
sured both on a per-byte and on a per-packet basis, be 
reduced. Additionally, die end-ro-end latency (the time it 
takes a message sent by an application on one system to be 



received by an ai>plication on aiH>iher system) must be 
reduced from himdreds of microseconds to tens of micro- 
seconds. 

Cache coherent I/O, scatter-gather I/O, and copy-on -write 
I/O all offer reduced resource consimiption or reduced la- 
tency or both. They do this hy reducing data movement and 
processor stalb and by simplifying bodi the hardware and 
software necessary to manage compli*x DMA operations. 

Cache coherent I/O reduces the processor, bus, and memory 
bandwidth re<:]ihred for each unit of data by eliminating the 
need for the |)rocessors to mai\agp the cache and by reduc- 
ing the nimiber of times data must cross the memory bus. 
The proc:essor cycles saved also help to reduce per-packet 
latency. 

The I/O adapter's address ti"anslation facility carj be used to 
implement scatter-galher J/0 for I/O devices that cannot 
efficiently manage pliysicaJly notLConiigiious buffers. Pre- 
viously, drivers needed to allocate large, physically f ontigu- 
ous blocks of RAM for such devices* For outbound I/O, the 
driver would copy the outbound data into such a buffer The 
mapping facilily allows the diiver to set up virtually contigu- 
ous mappings for physically scattered, page-sized buffers. 
The I/O device's \aew of the buffer is then contiguous. This 
is done by allocating the largest range tliat the I/O mapping 
services allow (32 K bytes, currently), tben using the remapO 
facility to set up a tianslation for each jjhysical page in a 
DMj\ buffer, losing this facility reduces live processing and 
bus bandwithh necessai>\ and the associated latencies, for 
moving noncontiguous data Requiring only a single address/ 
length pair, this facility can iilso be used to reduce the pro- 
cessing necessary to set up DMAs for, and tlie latencies im- 
posed by, exist ing scatter-gather I/O mechanisms tl\at 
require an address/length pair for each physical page. 

Cache coherent 1/0 can be used to achieve tme copy-on- 
write functionality. Previously, even for copy-on-^Tite data, 
the data had to be flushed from tlie data cache to physical 
memoiy, where the I/O de\ice could access the data. Tlus 
flush is essentially a copy. Cache coherent I/O, by aMow^g 
the I/O device to access data directly from the CPU data 
caches^ elhninates processing tune and latency imposed by 
this extrti copy. Hie hardwtire can now support taking data 
straight from a user's data buffer to the I/O cie\1ce. 

To take advantage of the optimal page attributes w^here pos- 
sible (e.g., 10_FAST for mboimd D>L\ bidfers) while ensuring 
correct behavior for devices that require suboplunal memory 
accesses such lis I/O semaphore or locked (atomic) memor>' 
transactions, the mapping facihty caii be used to ahas nnild- 
ple 1/0 virtual addresses to the same t)hysical aci dresses. 
Some software DM4 progriuiiming models place DMA con- 
trol infoiTuation immediaiely acljaceni to tJie DMA buffer 
Frequently, this control infoiTuation must be accessed by f he 
I/O de\ice usijig cither read-modily-v\Titc or locki!ig behavior 
By mapping the same page twice, once as lO^SAFE and again 
as 10_FAST, and providing the I/O device witii both I/O virtual 
addresses, the device can access meniorj- usmg ot>th"na] 
IO_FAST DMA for bulk data transfer and IO_SAFE DUlfor 
updatmg the control structures. 

Finally, thrcjugh careful programmuig. it is possible to take 
advantage of lO^FAST coherent I/O, and to allow die driver to 



5S Fdbmiaiy 1996 Hfwlett'Paekand JonmaJ 



)Copr. 1949-1998 Hewlett-Packard Co. 



maintain coherence for the occasional noncoherent update. 
For example, it b possible for the driver to flush a data 
stnicture explicidy from its cache, which will later be par- 
tially updated through a write^uige transaction &om the 
adapter. This has the ad\iintage of aJlo\*ing tlie adapter to 
use its optimum form of DMA, while allowing the driver to 
detarmine when coherency must be explicitly maintained. 

Performaiice Results 

To coUect performance results, the SPECt SFSn (L\DD1S) 
LI benchmark was nm on tlie follo%*ing configuraiion: 

• 4-way HP J^OOO Model K410 

• IG-byte RAM 

• 4 FW'SCSI interfaces 

• -76 file systems 

• 4 FDDI nem^orks 

• HP-L'X 10.01 operating system 

• 132 NFS daemons 

• 8 NFS clients 

• 15 load'generating i>rocesses per client 

The noncoherent system achiex'ed 4255 NFS operations per 
second with an average response time of 30,1 ms^operation. 
The coherent system achieved 4651 NFS operations per sec- 
ond with an average response time of 32.4 ms/operation. 

Tlie noncoherent system was limited by the response time. 
It*s likely that with some fine-tuning tlie noncoherent system 
could achieve slightly more throughput. 

To compare the machine beha\ior with and witliout cohereni 
I/O, CPU and 1/0 aciapter Tnea^ureineiits were taken dming 
several SFS runs in ll)e configuration described at>ove. The 
requested SFS load was 4000 NFS operations per second. 
This load level was chosen to load the system without 
hitting any known bottlenecks. 

Clomparing the number of instructions executed per NFS 
operation, the coherent systerti showed a 4% increase over 
the noncoherent system, increasing to 40,100 instructions 

t SPEC stands for Sysiarns Pefformarwa Evalustion Oodperatius, ati indy^trV'SlaAdafd bench- 

irarkmg consortium, 
tt The SPEC Sf S Usfvchmark measures a system's disuibuted f rli systam (NFS) partorfnance, 



from 38,500. This increase was bec^ause of the overtiead of 
the mapping calls. If we assume an average of 11 map/tinmap 
pairs per NFS operation, then each pair costs about 145 
instructions more than the alternative broadcas! flush/purge 
data cache calls. 

The degradation in path length was o&set by a 11% improve- 
ment in CPl (cycles per instruction). CPl was measured at 
2.01 on the coherent s>^em and 2.42 on the noncoherent 

system. 

The overall result was a 13% improvement in CPU instruc- 
tion issue cycles per NFS operation. The coherent system 
used SO J0O CPU cycles per operation* while the nonco- 
herent system needed 93,300 cycles. 

To determine the efficieno^ of the software algorithms that 
manage the L^O TLB and to evaluate the sizing of the TLB, 
the number of 1/0 TLB misses was measured (hiring these 
SFS nms. Under an SFS load of 4000 NFS operations per 
second, the disk drives missed L30 times per NFS operation, 
or 0.64% of aU accesses. 

Ackn o wledgments 

The author!^ would like to acknowledge Monroe Bridges, 
Robert Brooks?, Bill Bi^'g, and Jerr>* Htick for their contribu- 
tions to the cache coherent I/O definition 

Bibliography 

1, W R. Bryg. J.C. liuck. R.B. Lee. T.(\ Millei-, aiitl M.J. Malion, 
"'Htiwlett-Packard Precision Arciiitecture: Tlie Processor*' Henieft- 
Pfwkar-d Jounial Vol 37, no. S, August 1986, pp. J-2L 

2, J.L Heiinessy and DJL Patterson, CumiwterAiThitecnu^ a 
QuQiilitath^f' Approach. Morgan Kaufmann Publishers. Inc.. U>9ft 

3, HP-UX iMver Df-iHoptmmf Guide, Prolmunarj^ Dral'l . HP 90t)0 
Series 70Q Computers^ He wiett- Packard Company, Januar?^" 1995, 

HP'UX S.* and 10.0 for HP 3000 Series 700 and BOD Gmnputers are X/Open Comfjanv liNJX 93 

brandad products 

UNIX'* IS 3 rsgisrered trademadt m the United Smes and olheF countries, licansarf exclusively 

through X/Dpen Cempanv timitBd. 

K/Open'^' is a regisiered rrademafJf and ihe K device is a tradetriafk ot x;Dpen Company 

Lrmitsd m th^ UK and cither countries 



Ff l>n i;iry 1 1 m^ I l^^wldt- Pack im 1 rr>unial 5 & 



)Copr. 1949-1998 Hewlett-Packard Co. 



A 1.0625-Gbit/s Fibre Channel Chipset 
with Laser Driver 



This chipset implements the Fibre Channel FC-Q physical layer 
specification at 1.0625 Gbits/s. The transmitter features 20:1 data 

multiplexing with a comma character generator and a clock synthesis 
phase-locked loop, and includes a laser driver and a fault monitor for 
safety. The receiver provides the functions of clock recovery, 1 :20 data 
demultiplexing, comma character detection, and word alignment and 
includes redundant loss-of-signal alarms for eye safety, A single-chip 
version with both transmitter and receiver integrated is designed for disk 
drive applications using the Fibre Channel arbitrated loop protocol. 

by Jnstin S. Chang, Richard Dugan^ Benny W.H. Lai^ and Margaret M, Nakamoto 



The inroiination revolution lias jDiished the dataromm world 
to gigabit rate^, Hewlett-Packard s G-Link chipsetJ (HDMP- 
1000) helped paveti the way for low-cost gigabit technology'. 
Since diat debui ^ inipurtant staiidiucls such as Fibre Channel 
(FC) have incoiporated gigabit rates in their docimients. HP 
now offers a low^-rost solution fbr Fibre Channel applica- 
tions witli the HDMP-1512 and HDMP-1514 iraiisriiiMer ai\d 
receiver chips, respectively. 

The chipset Implements tiie physical layer interface as de- 
fined in Fibre Channel specification FC-O,- Both the trans- 
mitter and the receiver use a ^'liaug-tnuig" phase-locked loop 
lechniqiie similar to tlie G-Link chipset. Since the staiidard 
allows the use of eitber copper or fibre media, the iransinit- 
ter lias an integrated CD (compact disk) laser driver in addi- 
tion to two 50-ohm cable drivers. Out of concern for eye 
safety, the standard requires the chipset to include certain 
monitors and controls to interface to an open fibre control 
(OFC) chip, so the chipsel includes a laser fault monitor and 
loss-of-signal alarms. 

Tlie chipset's speed is selectable: either 1062.5 Mbits/s or 
531.2r3 Mbits/s. To consen^e power, the receiver chip imple- 
ments a demultiplexing scheme that allows the use of lower- 
speed and low^er-pow^er cells to recover liie paj'aOel data. A 
special selectable ""ping-pong" mode for the parallel TTL bus 
helps reduce switching noise on the supply lines. The upper 
and low'cr 10 bits aie shifted by half a clock cycle relative to 
each other when ping-pong mode is active. 

Both clups are implemented using a proj)rieiar>" HP de\1ce 
array based on the ILP25 bipolai- process, a 25-GHz fr Pro- 
cess. The aiTay concept not only allowed quick design cycle 
times but also enabled the fabrication of a single-chip Trans- 
ceiver that integi^ates both the transmitter and the receiver. 
The 10-bit transceiver design heavily leveraged the cells of 
the chipset. The transceiver is designed for disk diive appli- 
cations using tiie Fibre Channel aii>itrated loop (FC-.^J 
protocoL 



System Canflgu ration 

The chipset is designed for use in a Fibre Channel optical 
module. Fig. la show^s a t^^ical system configuration. There 
are three ICs: the transmitter, the receiver, and the OFC 
(open fiber control) chip. Tlie transmitter and receiver chips 
ase a system frame clock (53,125 MHz) for transmission and 
to assist in meeting five data lock tune. The transmitter uses 
an external p-n-j> power device to hatulle the potentially kirge 
laser currents^as large as 1 30 mA. The OFC controller 
monitors link status lines from the transmitter and receiver 
chips to handle the safety protocol and the link startup 
sequence as described in the standaid. A photograph of the 
assembled module is shown in Fig. lb. 

Transmitter Chip 

The transmitter block diagram is shown in Fig. 2. It consists 
of three m^yor blocks — the laser dri\'er, the multiplexer, and 
the clock generator and pha.se-locked loop — plus a host of 
I/O and other supporting cLrciiitiy. 

Laser Driver 

The rluve flistinct laser driver sections are the dc bias cir- 
cuit, the ac dilven ajid the safety cii'cuit, as ,sho^m in Fig, 3. 
Tlie laser driver Is designed for anode bias configuradons 
^md operating rates up lo L0625 Gbits/s. The dc bias circuit 
can handle optical devices that require up to 130 mA of bias 
cLu-renl and have V^|^ as large as 2.3V. For T80-nm CD lasers 
operating at ^^-dBm optical power out, the typical dc bias 
current is 40 niA and the monitor diode current is 400 fiA, 
The ac drj\'er provides a minimum modulation of 25 mA 
peak-tO'peak into the laser. 

The dc bias circuit and the ac driver are decoupled for ease 
of adjiistment. The decoupled scheme allows adjustment of 
eidier the average power or the modulation depth without 
affecting the other. Both of d^ese settings are detennined by 



60 Febrafliy 1996 Hewlptt-Packafd JoiimR ! 



)Copr. 1949-1998 Hewlett-Packard Co. 




Laser C^n^'Dft Controls 



Module 
Coniiols 



Ofien Fibef 

Conifal 
[OFC|thi{] 



Loss-off-SifFiil 




(b) 



r 



ih] 




Fig, 1* (a) BlfH k diagram of the HP HGLM- 1068 module. 

(l>)HPKGLM-inH3ninfhjU^. 

resisth^e elements. Tlie safety rirciiit monitors fault condi- 
tions to ensure that the laser optical output power will not 
be at imsaf e levels. 

DC Bias Circuit Referring to Fig. 3, the dr bias of the laser is 
controlled hy (he operational ampUfier fpedbacl< loop. Tiie op 
amp s positive input is Intenudly set to 1 .85V, It is obtained 
tiirough a voltage divider from the 2.42Y bandgap reference 



node (outpm of the bandgap reference circuit^). LZBGTP. The 
negath-e input, iZMDF, is connected to a bias nerpiork con- 
trolled by the laser monitor diode. More ciwrent to the laser 
creates a larger monitor diode current, lowering ihe voltage 
on LZMOR This results in a higher output voltage on LZDC. 
Ttiis decreases the \h€? of transistor PI, thereby lowering the 
laser bias current until the monitor diode node LZMOf and 
the intenial reference node are again equal. The monitor 
diode lias a slow optical response (rise and fall times = 
10 ns): thus, it acts as a low-pass filter to improve stabilityv 

The gain through the op amp and the p-n-p transistor affects 
the accTuacy of the loop in holding to the originai setting. 
The op amp has a typical gain of 20 dB, Depending on the 
external components used, the total voltage loop gain is 
nominally 40 dB, This is adequate to hold the bias. Current 
gain is suppHed by tl^e external p-n-p transistor 

AC Driver Circuit. The ac driver uses a differentia] collector- 
driven output coniigiu-ation. The nominal output impedance 
is 50 olims. The drive current is controlled i)y t!ie external 
potendometcr Pot 2. A teniperaliu-e and supply compensated 
constant-current bandgap reference is used to bias the cur- 
rent source. The extenial resistor should have a low tempcr- 
aluie coefriC'ient to minimize its effect on the ac drive £is the 
1 entperature cliaiiges. Tl\e supply to the final output stage is 
made available to the user for filtering out supply noise. 

The equi\^ent ac impedance for the laser diode is on the 
order of 10 ohms. To assist the ac driver hi driving current to 
the laser and not to tlie supply path, an S-olini resistor and 
an RF filter aie used to increase the impedance looking into 
the dc bias network. 

Fig. 4 shows the optical eye pattern fiom a 780-nm CD laser 
Tbe typical 20-to-80% rise or fall time uito a 25-ulun load is 
250 ps. The driver can aLso be acUusted to operate with 
1100-nni sirtgle-mode lasers. 

Laser Monitor and Safety Control. The built in safety circuit 
uses the moni(t>r (iiode t nrrent to check for high optical 
output power, Tlie circuit monitors the laser output for devi- 
ations larger I ban ±10% from tbe nominal power setting. If 
the opticiil power is out of this window, the monitor stalls 
the laser turn -off process. If the fault state continues for a 
time set by the error timing cajiacitor LZTC, the laser will be 
turned off. Tbe circuit can then only be reset by cycling the 
laser-onpin^ tZON. 

The laser safety circuit can be activated in different ways. In 
all cases, the laser is turned off by pulling a large current at 



-COMGEn 



om &vte 1 

nii0'i9] 



TS2 EWRAP 
±Si I TS1 




Transmit SytQ 
Clack |T6C» 



Laser Cpnirols 



PPSEL SPDSEL 



Fig. 2, HnMF-Inl2 transmitter 
chip bt(M:k diagram. 



)Copr. 1949-1998 Hewlett-Packard Co. 



February 1996 Hewlett-Packard Jouma] 6 1 



la$ar Driver Seciidii Af Tiajismittsf Chip 



tnlAmal 

Data 

Stresm 



Ifipui 
Selfici 






Vcc.LZ 






AC 

Amptilier 



LZPWRON 



b^^ ^^^ 



Vec„lZAC 






i ftl |iF ZZ 0.1 jiF 

V V 



KtzmfT 



■J^yj 2Si i Transjn js$iQ jt li Its 



. Pot2 
50Q i^ 0.1 \if 




^^< ^J^» 



WinifDW 
Detector 



Bandgsp 



B^ndgap 
Dil&ctor 



^^ ± lOnF := 10 nF 



izrc 



LZBGTT 



' 2^il External 
Control 



ms^ 



Vpz 



Operaiii^nal 
Ampltftef 






-Y<^ 



612 



♦-wv — Ih- 



Laser 

-N- 



UMDF 



u 

-M- 



Poll S kll 



Laser 
Monitar 
DJoda 



Fig, 3, fjiser traiisn\itter blo<:k diagram, 

the L2MDF pin. tausing the window detector lo sense n fault, 
Tliiy causes the LZOC oulput to go high aiHi mm off traiisislor 
PI. At the same linie, the ac driver is held In a static state. 
This is necessaxy because rite ac circuil has enough outpul 
drive to pulse die laser witlioiit die de bias cuiTeiU. 




d*iS£« na 



eti, 3 


- 


50.ee mVDltm/riiv 


Qffiti 


• 


iee,e Pi^JoitB 


f ipie&aiB 


- 


zee ps/div 


ifiliiy 


4 


lfi.25£^ na 


Dfllift ^ 


w 


170.31 nUsHa 








Uj«i»rliflr( 


^ 


IT. 11? mWoUb 


Linsr^iar? 


> 


1S7,50 nUait 


Bfllta T 


- 


952, e p9 








Start 


- 


is.'^iee fl* 


BtOD 


* 


I7,732e na 



TrlQQflr on ExtfirFui at Paa^ EdQS dl -440,0 mLFalta 



Fig. 4, 78{J-nm CD laser eye par rem {-^l cSHm). 



The \viridnw rletector monitors the voltage on tZMDE The 
high aiid low levels are set ai i.85V ±10%. ITiis translates 
tiirecdy to monitonng whether the optical output power has 
de\4ated more than ±l(f% from the nominal setting. If LZMDF 
goes out of tliis range, the dimge on the LZTC t^apacitor will 
be discharged by a few^ hundred microampaeres of cun'ent. If 
the fault continues and the voltage lowers to the fault value 
(1.3V0, then the error detector cell will out]>ut a TTL high 
level on the i2F pin ami lum off lire dc bhm. I'he enoi' time is 
set by the capacitance tiet ween LZTC and groniid. Tlus wiU be 
a few milliseconds for a 0.1 -pF capacitor 

TIrere is also a bandgap monitor cell wiiich checks for gross 
faults with the 2,42V bandgap. Because this t>andgap is used 
in setting the wuidow range, a change will ntjt nccessaiily 
cause the window detector to sense a fault . The bandgap 
monitor uses the VVj. voltage as a reference to sense when 
die batidgap is higher than 3,0V or lower than 2,0V The 3.0V 
translates into a maxiimmi 2^ ijicrease in optical output 
powder before the laser is tmiied off. 

Once a fault has been detected, this condition is latcrhed 
imtil the laser djiver is reset using the LZON pin. A TTL high 
on LZON vviU cliarge the LZTC capacitor while holtling die 



62 February 1906 HpTrl^n-Packarci JounrnJ 



)Copr. 1949-1998 Hewlett-Packard Co. 



Data Byte 1 



Data Bftt 




Input 
Flip-Hops 



ppsa 



9S5EL 



Selector Logic 



Slrrft 
fle^f stars 



issa 




> mm 



djc 



Fig. 5. Thiiisinltter multiplexer 
block. 



laser output off When 120 N is set low again, all laser cir- 
cuitry is enabled 

IVansmitter Multiplexer 

Tlie block dlagmin of the transmitter multiplexer Is shown 
in Rg. 5. hi the normal ld62f>Mbit/fy mode, sliift registers 
are loaded with die 20 -bit -wide parallel data and dien shifted 
lo fomi the high-speed output. To conser\^e power, an inter- 
lac ing method is used to allow the shift registers to operate 
at half speed. These legisters are separated into two banks 
of ten and are loaded with the proper bit order. The outputs 
of the two banks ai'e then combined with one liigt\-speed D 
flip-flop. In the 531.2r)-Mbit/s mode, only 10 hits Jire loaded 
into a single bank and the second bank is ignored. 

The data b>te 1 u^puts are inserted with a set of latches, 
allowing this data lo be shitted by one-half bit relative to 



SPOSEL 



Q1 

02- 



03- 



=» 



^—9- 



data byte 0. This configuradon aOo\vs for the ping-pong 
mode of operation, in which the tw^o input bjtes ai^e time- 
shifted by one-half bit to minimize possible switchuig noise. 

An extra feature of the transmitter is "comma" character 
generation. When tliis mode is asserted, the K28.5 character 
(00111 U 010) is loaded into the shift registers. This is partic- 
ularly belidul in the evaluation phase of the chipset for byte- 
aligning the receiver without the hig.her-order PC-1 and FC-2 
chips. 

IVansniitter Clock Generator 

llie logic block diagram of the transmitter clock generator 
is shown In Fig, 0. It tfikes the injjut from die VCO at 1062,5 
MHz and derives the necessaiy clocks for the multiplexer 
Tlie (!loek rale reduction involves serial divisions of 2, 2, and 
5 for the 1062.5-Mbit/s (2()-bit) mode and 2, 5, and 2 for the 



I 



RBSEL 



#^ 



3fWS 



> FFRW 



L5Slt 



FVeO 



tPbJrB 



I 



* ►« 



-►ft 



Fig. 6. Ttananiitler dock 
g<?rnerator. 



)Copr. 1949-1998 Hewlett-Packard Co. 



Kebmary 1996 HewleU-Packajtl .ia iini nl 63 



2 3 



0«_HEF 



±DI 



^^1 


^^^^^^^H 


■n 


LDS 


I^^B! 




1 


^^H 


■ 


^^^1 




Cable 
Eqtj^ lifer 


e 



VCC.KS 




±LIN 






> COMDET 



y^ Data Byte 
HX[«IdMJ 

Data Byte 1 
RJCflQIS] 



Fig, 7. HE-)MP4514 receiver chip 



Dril ,25"Mliitys (10-bit J mode. TJie riiviileby-S fmiction is Icust 
Ibr the 1062,5-Mbit/s mode, allowing it to operate at the 
slower speed to reduce power. All of the clocks aie retiirted 
by the Mgh-speed clock to ensure proper clock aligmuent 

IVansmitter Phase-Locked Loop 

The phase^ocked loop is a bang-bang type and is able to 
lock onto the reference clock at 53. 125 MHz. It consists of a 
mocUfied sequential detector, a cliarge pimip integrator, a 
VCO, and the clock generator. Tlie detector, integrator, and 
VCO were leveraged frcini the G-Link chipset. ^ The nominal 
bang-bang Lane of the VCO is 1 ps per cycle. 



Receiver Chip 

The block diagram of the receiver chip is shoT?tTi m Fig. 7, 
It consists of the deniuJtiplexerH the phase-locked loop and 
clock genemton redundtmt losa-of-signal (LOS) detectors, 
and other I/O and supporting circuits such as the power-t>it 
supeniistjn 

Eeceiver Demultiplexer 

Tlie logic ciiagrain of tlie demultiplexer is shown m Fig. B. hi 
addition to providing the serial-to-parallel conversion of the 



Data Bvte 1 
RX [10:191 



Data Byte 
RX [00:091 




BESET -4 



Registers 



f'ig* 8* l?ecpiver den^ultiplexer. 



64 Fpliruaiy 1006 Hewlett-Packarid Journal 



)Copr. 1949-1998 Hewlett-Packard Co. 



input bit stream, it also detects the comma character 

(001 1 1 1 Ixxx) for pro|>er frame alignment, as required by the 
Fibre Chamiel standanl Once the comma character is de- 
tected, a reset signal is sent to the clock generator for syn- 
chronization* 

lb ntininiize power consumption, an interlacmg method of 
demultiplexing is used in the receiver chip. The high-speed 
data stream is first deciphered into two streajiis at half the 
rate, and these are loaded into two baitks of sluft registers. 
The parallel data m the shift registers is then clocked into 
the outptit flip-flops at the frame rate. An extra bank of 
latches is added for data bjte 0, which enables the ping- 
pong mode of operation. 

Since there are two possible ways in which the deciphering 
can occur tliat is, bit one could be in either bank one or 
hank two. proper decoding is neetied to reiissernble the final 
byte pattern. This is accomplished by the selector block 
preceding tlie output flip-flops, and is controlled by how the 
comma character is detected within tJie two banks. \\Tien 
either case is detected, the reset indicator to the clock gen- 
erator is set luglt Since this signal is a critical patii in the 
overtdl operation of tlie chip, it is retimed to give the clock 
generator more margin to reset. As a result, the data is 
delayed by an additional cycle before being loaded out. Tliis 
delay Ls compensated by extending the shift register count 
by one. 

The data is further cielayed before imloading by tlie atitl- 
sliver feature (discussed i^ext ) in the clock generator where 
the clocks are extended. The number of registers is in- 
creased in the same maiuier to compensate. 

Receiver Clock Generator and AntisUver Circuit 

lite logic diagram of tlie ckjck generator is shown in Fig. 9. 
In a mamier similar to its transmitter coiurterpan, it takes 
the VCD output and generates all of tlie necessar>^ clocks 
required by tlie chip. It includes an antisliver circuit, which 
ensures that the frame clock presented to die user has no 
**sliverSj" as explained below. 

The core of tlie generator is a divide-by-5-or-iO counter. Tcj 
minimize power, the friuue clock goes tiuough a 2,10 sciditig 
for the 1062.i>Mbit/s mode, and a 2,2,5 scaling for the 



5PDSEL 



RESET 




fF3IM 



FVCO —^ 



Wig, 9, Receiver elock generator. 



^n 



>n 



531.25-Mbil/s mode. However. RBCLK requires a clock at the 
frame rate thai nnm have a 50%^ duty^ cycle. Since this is not 
possible \dth the natural states within the di\ide-by-fi:\e 
counter (2.^5 or 2/5), the pulse of the 2''5 count is extended 
by one cycle of the high-^speed clock, yielding the required 
50% duty cycle. 

When the reset signal is applied, the counter is forced to a 

predetermined state. Because the previous state of the 
counter is random, the fnmie clock may contain short pulses 
or slivers, which could cause problems for the usen The 
antisliver circuit continuously monitors the counter for this 
condition and masks ihese buiBts as they occur. The logic 
accommodates both the 53L25-Mbil/s mode and the 
l(K32.5-]Mbit'S mode. 

Receiver Phase-Locked Loop 

The phase- locked loop of the receiver uses the same basic 
configuration as the tnmstnitter phase-locked loop, widi tiie 
additioti of a phase detector for NRZ data. The design of this 
detector is identicai to the detector used in another propri- 
etary HP IC.^ ^^ith tlie exceiition that the falling edges are 
ignored. Tliis is to eliminate any effect of tiie excess jhter of 
the failing edge, wMch arises from the self -pulsing CD 
lasers. Tlie lock'to-reference (-LCfCREF) input enables the 
user to activate the frequency detector for initial frequency 
acquisition. 

Receiver Loss-of-Signal (LOS) Detector 

With ni^t^r concerns for eye safety, die Fibre Channel stan- 
darfi calls for retUHidajit LOS detectoi-s witliin the optical 
morinle to ejisure a robust system. Two LOS detectora are 
mcorp orated in the receiver IC, and are provided as outputs 
to the OFC chip. Since the alarm outputs are hea\ily filtered 
within the OFC chip, hysteresis is tiot necessar>^ in the LOS 
design. 

Tlie LOS detectors are based on a concept of envelope 
detection without the use f>f ii catmcitor. One detector is 
configured to detec t the loss of amplitude resulting from a 
lack of received signal. The second detects the condition in 
which the differential Inputs are static, indicating a fault in 
the optical receivt^r Roiii I/)S den-c tors are liiilher dij^itally 
filtered, witli one driven hy the reference riovk and the 
other by an internal clock. This further ensures the reliabil- 
ity of the faLdt detection system for maximum safety. The 
triggered threshold is preset to 25 niV and can l>e adjusted 
with an external rc\sistor, as shown in Fig. 10. 

The receiver front-end sensitivity is well below the nominal 
LOS threshold of 25 mV. Fig. 1 1 shows the bit error rate 
(BER) as a function of the input ddferentiai signal The BEK 
Ls basically zero for signals 6 mV and above. Because it is 
impractical to i>erfonn actual tests ftjr BER as low £is K^-^* 
( — 3000 years), one can use the plot to extrapolate the BER 
for the inconung signal amplitude. Tests have been run for 
weeks wiUiont a single error, 

TVansceiver Chip 

One m^jor tm-get application for Fibre Channel is disk ar- 
rays. This aj>pliraiion demands a mm^h kiwer-power and 
tower-cost soluticHi tliau the chipset offt'ii*. The new HP 



Pobniai-y 1 tH)f ] } le wlett-f*a<; Itard f on n s al fi 5 



)Copr. 1949-1998 Hewlett-Packard Co. 




1 1.5 

OR REI= (V) 

Fi^. 10, LOSthmshold as a function of DR^REF. 

HUMP-1526 transceiver is a IQ-bit, 10G2.>5-Mbaud transceiver 
designed Tor tMs Fibre Ch^mnel arbiti'ated loop ( FC-ALJ 
[iiarkel. It is a descendaiU of the HDMP 1512/HDMP4514 
20-biL, 1062. 5/53 1.25-Mbaud transmitter/receiver chipset. 
The Fibre Channel chipset has many fuiiclioTis not needed in 
the FC-AL transceiver chip, such as optical iiiteiface blocks. 
Fig. 12 shows the block diagram of the transceiver. Tlie re- 
duction in fiuictions atKl the cliange to a 10-bit tins allowed 
Uie integration of both transmitter and receiver linKiions 
onto a single die, using a proprietary HP device array. Tlie 
transceiver uses a 10-bit parallel interface nmning at 
106.25 Mbaud instead of a 2()-hit parallel interface nmning 
at5ai25Mbaud. 

Deletions on the transmitter side include the ac laser driver 
and its support ing dc bias ciiciiitry and the comma genera- 
tor function. Deletions on the receiver side inclutie the pow- 
or-on/reset circuitiy, the loss-of-signal circuitry, and the 
cable equalizer function. Deleted fimctions common to both 
the transmitter and receiver include the speed selector aitd 



)o ■ 1 




1B-2. 


^..,,,^^^ 


10-3- 


_ ^ ^ (Vcc = 5V.Ta = 25°C.T, = «"C} 


io-«- 


^"\^^ 


g 1G-S . 
1 10-' - 


\^ 


= n^ - 


\ 


10-3 _ 


\ 


,0-10 


\ 


10-"- 


\ 


10-1^- 


h- \ \ \ \ 1 



Z5 



3.5 4 4.5 

Vjr, p-p differential (mV^ 



5S 



Fig* 11. BE R as a fimction of data Input Eiriiplitude. 

the pmg-pong selector. The loopback ports were also de- 
leted liecause an internal connec^tion is possil:)le with a 
single-chip design. This leaves just one high-speed output 
port and one-high .speed input poit. Other liming changes 
were implemented to fit specific customer needs. 

In the single-chip design ^ select inputs and clocks me shared 
between the transmitter and receiver. This loweis the power 
requirements, reriuces the number of pins, lukI makes pos- 
sible a smaller chip size if a custom layout is done at a later 
date. The transceiver, with its i.8~watt total power dissipa- 
tion (compared to 3 watts for the chipset), is packaged in a 
sin^e 64-pin 14-by-14-nun quad flat pack. 

Isolating the two independent phase-locked loops within a 
3.54-mm-by-3.54-mm tuea presented the biggest challenge 
dming tiie layout of the chip, Tlie transuiitter and receiver 
phase-locked loops are placed at opposite comers to mini- 
mize cross talk. Various poitions of the chip are isolated by 
using sepaiate power supplies imd baiidgap refereiu^es. 



Data Byte 
TXED-9] 



■ntcAPo 

nCAPI 




ABCl -4 



Pfisse-locked 

IjQDp and 
Clock Gertfirstar 



Internal 
Transfnit 
Clocks 



I 



Oata Byte 

nn [0-5j 



RcEeivef 
Pha^e-Lftckeil 

LDDp and 
Ctoclt Recovery 




> ± DOUt 



LOOPEN 



->iO0UT 



ensvNc ENBYTsrhc 



Fig. 12. Transceiver block diagram. 



66 February 1996 Heu' ] en - Packard J oum a 1 



)Copr. 1949-1998 Hewlett-Packard Co. 




P3^ 








Fig- 13- Ti"ansLeiver die (prDprietJii:^^ HP array). 

Although much of the chip and block-le\^el layout was re- 
done staniiig rn>ni the 2()-l:nt c hipset^ the proprietao' HP 
array enabled us f o complete The traEisceiver design ver>' 
quickJy. Fig, 13 shows a photograph of the FC-AL 10-bit 
transceiver die iniplemented on the array. 



Summarj' 

A two-chip gigabit chipset conforming to the FC-0 specifica- 
tion has been fabricated- The speed of the chipset is user- 
selectable at either 1062,5 Mbaud or 5S1.25 Mbaud, The 
1 T :i3i'<Ttiitter integrates a high-speed laser driver capable of 
:\g either 780-mB CD lasers or i:300-njii lasers. The re- 
cvn er has redundant loss-of-signal detectors for eye safet>^ 
The cMpsei runs on a single -"-S^dc supph' and has TTL data 
aitd control interfaces. Lmplementalion osing a proprietary' 
HP de\ice array allowed a quick design cj'cle to produce a 
1 0-bit singlen^hip transceiver for die FC'-.^ market, 

Ackno wl edgme n ts 

The autliors would like to thank tlie following people for 
their invaluable help and support in the development of this 
chipset: Han-ling Beh, Virginia Brown, Chee Chow, Mike 
DLx. Kamal Elalii, Chris Ocampo, Pat Petrunu, Bmce Poole, 
aitd Ron WTutetree. Many thanks to Da\id Siljenberg at IBM 
for his Fibre Channel expertise. We would also like to thank 
the people from the Integrated Ciriuit Business Division's 
HP25 process atid support teams for tlieir great models, 
tofjLs. and fabricatiorL 

References 

1. C\S. Yen, R. Walker R Petnino, C. Stout, B. Lai, and W. MoFarland, 
-(i-Link: A Chipset for Gigabit- Rate Dat^ Coiumutucalitjur Ht^trkif- 
ParkafdrhunuiK Vol. 43, no. 3, Octobe'r 1992, pp, 103-1 16. 

2. ±^Sl -O.^JfJ Fibn^ aamml FC-0 Smndai-d^ 

3. BR. Gray and B.C. Meyi»r, Analy.'iis and Df?sign of Analog lute- 
gmted Chen Us, lliird Edition, Wiley, 1992. 

4. B. Lai aiid R. Walker, "A Monolithic 622-Mb/s Clock Extraction 
Data Retiming Circuit'* Ftvcmditigs of the fSSCC, 1991, pp. 1^14-145. 



P>hru^iy I5lfl6 Upwlett rarkar d Jni [itim! 67 



)Copr. 1949-1998 Hewlett-Packard Co. 



Applying the Code Inspection Process 
to Hardware Descriptions 

The code inspection process from the software world has been applied to 

Verifog HDL (hardware description language) code- This paper explains the 
code inspection process and the roles and responsibilities of the 
participants. It explores the special challenges of inspecting HDL, the 
types of findings made, and the lessons learned from using the process 

for a year. 



by Joseph J, Gilray 



The primary goal of the code iiispectioii process is to maxi- 
mize the quaiity of the code produced by aiT organi;iation. A 
secondary benefit of the process i.s that it allows membenii 
of the development teajTLS to sluue best practices. Tire code 
in SI J ec lion process revoK^es aromid a fonnal inspection meet- 
ing, 'i'lie process calls for the development of operational 
definitionst i>lanning. a technical oveniew, preparation for 
the meeting, rework after the meeting, ^md follow-up. Fig, 1 
illustrates the relationships between the steps. Ttie steps 
themseh'es are described in tlie set lions fit at follow. As 
sho^Ti in Fig. 1, the operational del in it ions alTeei all stages 
in the inspection process. Between some of tlie stages in 



Proce^rf? 



Technical 
Overview 



Prepfi ration 



Ope rati anal 
Definifions 



o 



Proceed? 



Inspection 
MeeMng 




FnlivwUp 



Proceed? 



the inspecdon process decisions whether or not to continue 
need to be made. These decisions are indicated on the figttre 
by "Proceed?". 

Tlie code inspection process as implernenied at llu^ HP Inte- 
grated Circuits Business Division in Con'allis, Oregon tICBD 
Corvailis) contains several roles: process manager, ntotlera- 
tor, author, paraphraser (reader), scribe, anrl inspector. 
Tlvere is only one pern\anent role, that of inspection process 
manager. The remaining roles are ftUed for each inspection. 
The subsections below call out the general responsibilities 
and duties of each role. Speciiic tasks are called out in the 
description of each process stage later in the paper HP's 
software quality engineering department has pub lis lied 
checklists for each role that can be very useful when gelling 
started with the process> 

Process Manager. Ensures that best practices are si:fread 
among die dusigners of the organization or project. Tasks 
include developing and pulMishing operaiional defmitions 
(described in the next section), disseminating best practices 
and comnnjn fii^ects, and acting as an advocate for the HDL 
code inspection process. This last item caimot be over- 
emphasized. The process manager must ensiure that priority 
is given to inspections even in tlie face of mounting sched- 
ule pressure on the design teiun. It is also imjioitimt that the 
process manager make clear ihai tlie spetifie results (de- 
fects found) of the inspectioas will not be made a\^ailalile to 
management or any other party. The inspection tJrci<^'"f'f^s can 
only succeed in an en\iroimieni where the itienil)ei-s of the 
design team feel secure in opening their code to re\iev\'. So 
that management sees the value of the process, the process 
manager should keep general results of the overall inspec- 
lion process such as the number and t^i^e of defects found, 
the lime spent, lines of code inspected, and mosi important, 
best practices sliared. These process statist i«*s *ut^ veiy^ use- 
ful, but die glass roots stippon ihat develops for rlie inspec- 
tion process will be the real indicator of its value- 
Moderator. Majtages each step of the process for a given in- 
s|)ecrion. Ensures that participants are prepared and thai 
requiretnents ai'e met. 



Fig. 1. Tlu^ code ii\spection process. 



6S Feti ruao^ 1996 Hew leti-Pacttard Jouinal 



)Copr. 1949-1998 Hewlett-Packard Co. 



Author, Prepares the HDL for inspection. Creates supplemen- 
tary* dmnimentation (such as block diagrams) as iieeessar>' 
to exi>lain tiie purpose of the code. Open to suggesdoiis and 
defects. Reworks the HDL as necessary. 

Paraphraser. Pamihar with guidelines and best practices, 
AhW in explain Uie HDL code during the inspection meeting. 

Scriiie. Logs defc?cts and enhancements found during the 
meeting. 

Inspector. Reads and understands tlie HDL Notes any de^ 
fects. conmients, or enhancements before the meeting. 
E\^i>^ person in%^olvcd in the meeting participates as an 
inspector. 

Development of Operational Definitions 

An operational defmition is simply a standai'd. Before an 
inspection takes place a core set of operatiomil definitions 
should be in place ^md recognized by the design team. They 
are developed from conventions, guidelines, indtistry siati- 
dards. anti recognized best practices. P'or HDL code inst>ec- 
tions at ICBD Cor\ allis* we adopted ihe sin^plest set of ojier- 
ational definitions Uiat we felt \^ere adequate to guide the 
process: 

• Coding st^'le standards. Altbough no exj>licit HEJL coding 
standard was selected, we developed a stanckuxl IIDL mod- 
ule header (Fig, 21 

• Definition of a defect* We defined a defect as any deviation 
fi-om die moduie specification as presented in the technical 
ovemew meeting (see below) and the HDL nK>diiie headen 

• Defect seventy codes. We applied a simple defect seventy 
scale l>a.sed on HP s internal Defect Tracking System (DTS), 
as shown in Table I. 

Table 1 
Defect Severities 

Name ID Description 

Critical 9 Defect will lead to ujiworkable or grossly 
inefficient design. 

Serious 7 Defect will tead to a large deviation from 
the sjjecifiration or to a design that is nn- 
reliiible or very inefficient. 

Major 5 Defect will lead to a deviation from 

the specification or to a design that is 
inefficient. 

Minor 3 Defect will lead to a minor deviation from 

the specification or to a design that is 
slightly inefficient. Also nsed when code 
is in serious need of comments to be 
maintainable. 

Wibni 1 "Wonhlii't il he nice if...?" Tlus ID is usecl 

for enlianeeniem reqnests, which are tyiji- 
cally changes in coding style or requests 
for clarifying commetits in the code. 

• Defecl logging standards. Wc^ started out nsing inspection 
(lala snnmiarj' sheets jjrovitled by HP's softwajc quality 
engineermg rlepiirtivH'nl, but after a few inspections we 
found that an open-fonnat inspection prot^ess and defect 
logs worked Ijetter 



// Filtiwfne 

// Module nBnie($) 

// Author nsnie>s^ 

// flevisiofi hq 

ft Rle descrrptftjn (wtiy are Hhese moilules gioui^td tctgethef l< 

// * . . 

// Hodutii nflme {for each mottule! 

// Uodule description 

// Sltn^t descnpTipRs I these mclud« hH HOi sigosts. including wires} 

// For ea ch si gn al spec ify: 

// - type 

// - purpose 

// - values^state^ descfiption 

// - invsrtants (such as trisiite nodes thai are always drrven^ 

// - special loariing^ candittons 

// - value at re^et 

// - ovedlow/wjaparound conditions le.g. for counters! 

Fig. 2. StaidaiMi Verilog HDL module header adopted lor code 
inspections. 

• Target -based best practices, IC'BD Cor\'allis develope<l a set 
of Verilog liDL coding guidehiies to ensure reliable* high- 
qualky synthesis results, Tliese gitidelines inclnde sections 
on clocking strategies, block stnictnre. latches and registers, 
stare machines, design for rest, ensuring consistent behav- 
ioral and structural simulation results, and issues specific to 
SjTiopsys sjTitliesis tools, which ai'e used extensively by HE 
This document provided valuable input to the HDL code 
uisjDection process and itself benefitted from Lhe practices 
shared during the inspections. 

• Inspection entry criteria. The inspection ento' <'ri(eria were 
that the HDL had to be functionally correct in behav1oi*al 
simulation and had to be of smail-to-moderate size (100 to 
700 noncomniettt Veiilog HDL soiuce statements). 

• bispection exit criterion. We did not develop a formal in- 
s]>ef tion exit criterion, bistead, the moderator was given the 
restjonsibility of ensuiing thai rew^ork was satisfactorily 
completed for each piece of HDL uispectecl 

Planning 

Wlien a designer feels tJiat a piece of HDL code is a good 
candidate for insijection, the designer asks another designer 
In act US moderator Togedier they review tlie HDL to I>c 
iiisijected to ensure that it meets the entry criteria, espe- 
cially that die amount of HDL to be inspected is apt^rt ipriate. 
In addition, they review any supplementary document^atiOR 
such as module specifications or block diagrams luxd dis- 
cuss vvhat wiO need to be jj resented at the technical over- 
viGw meeting. Tl\e moderator, with help froin lhe author, 
assembles Uie rest of tlie inspection team: a i:taraphraser 
(reader), a st^ribe, atid op to three additional itispectors. It is 
tiie moderator's responsibility to schedule the technical 
overvievt' meeting and to easure that the inspection teain 
membens are prepared to meet their responsibilities. Tlie 
moderator sh*>ulfl treat the meetings and preparation as a 
vei7 import aril reqnirenient for eat :h [>ai1icipjMit. Every per- 
son involvetJ must he prepared — at a cade hispection, no 
otte is j List ajt observer* 

Te<hniral Overview 

The Tcchniiul i a rniew^ meeting should last no more than 
UQ minutes. Its iirinuny tmipose is to allow the author to 
outhne the module( sj tt.i he inspectetl and to aitswer ques- 
tions. Tlie roles are fonnally assigned during this meeting 



)Copr. 1949-1998 Hewlett-Packard Co. 



Febniftiy 190(5 Itf^wlett-P^ckard Journal 69 



and til e moderator should ensure that all partidpants under- 
stand tlie roles assigned to them. If there are mexperierK^efl 
insj)ertion tettin members, tlie niodenitorshouki take tinu* 
lo explain the operational definitions and to pass oul re- 
sponsibility checklists for each role. Finally, any supplemen- 
tary^ documentation and the HDL code itself are distributed 
to the team. The code should be printed with line numbers 
so thai (kirijig the mspection meeting all team nienibeiii can 
more easily follow the discussion. 

Preparation 

Eacli mf rnbf^r of the mspection team should siitrul from two 
to fom' hours reading over the HDL Team members should 
mark possible clefects on their copies of the code> Team 
meml>ers should freely disc^uss the code among themselves 
bul not in a xvider context ^ to protect the i>nvacy of the 
author. The l ear ti should be given at leas I a week to h)ok 
over tJte HDL. During diis time lire rtioderatoi^ should sched- 
ule the inspection meeting. Before ttu^ meeting tlie modera- 
tor should ensure tiuit all team members are prepiu-ed and 
can i>attic'ipate in the meeting before allowing the uispec- 
tion to proceed. 

Inspection Meeting 

The in spec tit) n meeting is the heart of the process. Tlie mod- 
erator nnist resen'e a quiet room foi' a sufficient am omit of 
time. TJvpically inspection meetings take from two to three 
hours. The moderator is also responsible for keeping the 
meeting t>n track so tlie ct)de can be completely inspect eci in 
die time allowed. To stall the meetiiig the scnl>e shouhl re- 
cord the amount of preiiaiation time retjuired of each jjartir- 
ipant. The paraplii'aser should announce the order in wliich 
the code will be mspected. TVi:>ical]y this is top-down or 
bottom -up. Die parapluaser explains each block of code and 
allovt^s time for each inspector to discuss possible defects or 
enhancements to that code. 

The goals of the meeting may vary scmiewhat from organiza- 
tion to organization, t^ut typically the major goals are to find 
defects in the code luider inspection and to shiu^e liest prac- 
tjees among the members of the design or coding team. In 
our process, we encouraged discussion of any defect or 
enhancement. Although this does not strictly adhere to the 
bachtional software mspection i>rocess, we felt the l>enefits 
(unprovetl coding, simulation, and synthesis practices) justi- 
fied the time spent,* 

Tlie moderator must ensure that any defect or enhancement 
is recorded by tiie scribe and tliat the ins]iectJon team agrees 
to tlie severity assigned lo each item. To keep tJie grouj) on 
track, the moderator siiould not allow long discussions of 
the severity of any defect. \\Tiere no agreement can l>e 
reached, the morlerator should assign a severity. If the as- 
signment of a severity code becomes a stumbling block to 
progress in several meetings, a simpler major/minor severity 
chissification can l>e adopted tis an operational definition.^ 
The moderator slumld keep I rack of any best practices that 
come up dtuing the meeting that are [lot already pail of die 
operational deilnit ions and note any questions raised about 
relaierl design processes and tools. 

Rework 

Mter the meetuig the scribe gives the defect log to the audior 
(and only to the author). It Ls the autlior s responsibility to 



modify the HDL code as appropriate. If the author or the 
moderator feels thai ihe HDLshotUd be reinspectecl, an- 
other nieeting cmv be sciieduled (this shotild be very rare, 
and stiouid proceed with a different set of p^^lcipants hi all 
roles other than the author). 

Follow-up 

Alter each inspection the moderator should ui%'estigate any 
questions that were brougiit tip abotit design processes and 
tools, such as simulation and s>ti thesis. The t*esults of the 
mvestigation along with any new best practices should be 
published for die design teams. The moderator and process 
manager should also review the operational definitions and 
update them. Finally, the process manager should update 
the overall in.spection process statistics. 

HDL issues 

When hisj:tecting code written in a liigh-level softw'me !aji- 
gnage, nomially diere js a single target compiler and plat- 
form. We found that a m^or difficulty with inspecting code 
written in a hiu'dwaie description language was deciding on 
a target on which to focus. HDL is irathtionally targeted to 
both simulation and synthesis (among a growing list of HDL 
source-level tools). We started by ir>ingto it\specl HDL 
withoui [hinkhig in terms tjf a specific t^u'get. In theoiy-^ U 
miglit be ijossible to inspect HDL co<:le as aai abstract de- 
scription. In practice, it was nearly impossible. Botli the 
expected simttlation results atid the actual iniplementadon 
created by the sviithesis process w^ere always on the miiids 
of the inspectors (see Fig. 3). Furthermore, operational defi- 
nitions such as defect severity are in\ ariably developed and 
mteipreteci with reference to a taiget. 

By the thite we had done several inspections it was evident 
t:hal the most common iiractices being shared in the meet- 
ings were related to register transfer language (RTL) codmg 
for synthesis. Suict* the sj^ii thesis tools are not as mature as 
eidier compUei's for high-level languages in the softw^are 
donuun or simulatoi^ in the hardw^are domain, we spetit a 
good deal of time discussing what slnictiiral elements tlie 
synthesis tools uould <'reate from the RTL-le\'el HDL code 
given a set of constraints and synthesis options. This seemed 



Register Transfer Language 
Hard warn Oe&crrpliorf language 



I 



Synlliesis 

• Qprions 

• Configurati{»n 

• Consfiraiftts 



i 




Rest! lis and 

SiTuctural 

Representation 



Fig, 3. HDL is targeted to both singulation and sj^nthesis. 



70 Febniaiy 1996 Hewlett-Packard Jrnim a I 



)Copr. 1949-1998 Hewlett-Packard Co. 



natural gtven the compiexity of the synthesis tools. At times 
the inspection meetings focuse^i more on ti^e sj-nthesis tools 
thaR on the HE>L. When writing HDL for synthesis the types 
of complexities invoh^ed are more akin to porting a complex 
piece or softwane between frameworks than between com- 
pilers. Therefore, it is inevitable tiiat the inspection meet- 
ings devote a good deal of time to s>ti thesis. As use of very 
complex source-level loots such as belia%iorai synthesis 
tools becomes widespread this effect will become more pro- 
nounced. In fact, one benefit of the HDL inspection process 
is to share infomtation about the tools used during design 
creation. This happens less frequenOy in a traditional soft- 
ware inspection where the compiler is less configurable and 
better understood. But where the software Is targeted at 
many platfonns, chis lype of ciiscussion occurs in the soft- 
ware domain as well. 

Another difference w^e foimd between HDL code inspections 
and software inspecdons was that often there were rjuestloiis 
tbat couldn't be satisfactorily aitswered during tiie iiLspection, 
such as ''What will the sytithesiiEer produce from the follow- 
ing code (e.g., mixed addition and subtraction of registers 
witli differmg widtlis)?" It was up to the moderator to follow 
up with the author (or anotlier inspector) on questions that 
couldn't be answ^ered in the meeting and to write up a re- 
sponse for the design team and for possible inclusion in tlie 
best practices guidelines. 

Table IT indicates die kmds of topics that came up during die 
inspections and their approximate frequency. 



Frequency 
35% 



30% 



10% 



Table II 
HOI Inspection Topics 

Topic 

HDL coding style, standards, and guideiines 
(e.g., when to use blocking and nonblocking 
assigmnents, etc) 

Structures produced by synthesis tools (HDL 
cotnpilen design compiler, finite state ma- 
chine compiler) 



Dififerences between simtilation results and 
synthesis results 

10% fTDL efficiency considerations (e.g., inference 
of unnecessaiy latches, use of extra clock 
cycles) 

10% HDL documentation 



9% 



HDL block stnicture 



As more HDL inspections were perffin ned, the rmmher of 
experienced inspectors grew and (he guidelines for creation 
of HDL for synthesis, wliich liad been created by synthesis 
users in the lab, became widely dissenunated iuid discussed. 
Again, one of tJie pritnaty benefiti^ of the HDL <-0(ie inspec- 
tion process is the spread of best praclices among the larger 
group of designers» 

Lessons 

As Uie use of HDL increasetl in rmi lab, we noted a need for 
tools to hnprove tht^ rjuality of the Iil>L produced by the 
design teams. The lack of HDL source-level loots such as 
code complexity analyzers, fint (a syntax checker), and 



others led us to choose a l^s automated approach. Our first 
efiToit at improving the quality of HDL was to deielop an 
HDL code infection process based on the inspections done 
for sofH^are vnitten In highlerel languages. 

The process that evoK ed for inspecting HDL in our lab in- 
coiporates elements of both a formal code inspecdon pro- 
cess and a structured walkthrough process. Although we 
gave importance to the technical overview meeting, it wasn't 
aJways held, especially if inspection team members w^ere 
offsite. Furthermore, both tiie rework and the follow-up 
steps were left to the moderator and author and checked 
only informally by fhe process manager. 

Early in the adoption of the process we used a set of re^on- 

sibihtj^ checklists for each role. As time went on we found 
diat these were not stricdy necessary' but did engender a 
feelmg of fomiahty. It is important that the parti ctpants take 
the process seriously to enstire that die lin^e spent on it Is 
not wasteiL 

Over time we came to realize the importance of tlie technical 
o\^eni.ew meeting. If it is impossible for the author to attend 
the meeting (we ran into several cases where the audi or was 
from another site and imable to attend a techjiicai oveniew^) 
then someone else on the inspection team should take the 
authors i>lace for five meeting. In cases where we skipped 
tlie meeting, the i>repaiation time for each parrieipant in- 
creased dramatically. In one case the inspection required 6 
to 10 hours of preparation time. Though the code was fairly 
long at 900 lines of HDL. this wkiS an luireasonable amount 
of time to exiject from each reviewer attd could ha\e been 
reduced by half had there been a one-hour technical ov^er- 
\iew held, 

[n our experience, the n\ost significai^t benefit of the HDL 
inspection process was to spread HDL, simulation, and syn- 
tJiesis bc^st practices among the design teams. Not only did 
the procc^ss encourage interaction between various teams 
\\ithin K'BD. but several di^slgn teams in HI' entities outside 
of K'BD brought code to tis for inspection. To ensure tiiat 
this benefit is realized it is very important that the process 
manager Emd the nu.>deraf or>; take the thue to publish the 
guiilelines that are developcrl during each uispection. As 
designers beccjme proticietU at creating HDL tmd knowl- 
edgealile of synthesis and sunulation best practices, and as 
HDL coding guidelines become wellH?stabLished in an orga- 
nization, the need to do inspections to spread best practices 
dec^reasea 

We found relatively few ni^yor defects in the HDL code that 
was uispected, probably because the code was all at the RTl. 
level and slntulated and syitthesized before inspection. Stud- 
ies have indicated that the Inspection process gives tiie best 
results when applied at a high level of abstraction. I contend 
that we will fintl more defecti^ if we ai>i)ly the process to 
module specifications or to behavioral HDL. If the target 
chosen is c-omiilex (as behavioral syntiiesis tools curretitly 
are) the tendency for the pro(^ess to hjcus on the tool 
instea<i of the cocle will alst) lie more pronounced, Kven so, 
at.jplyiiig the iiLst>ectiou process to higlier-lcvcl abstractions 
may be a logical next step. Doolan wrote, "As people be- 
come aware of the tremendous benefits of the instiection 
process, Uiere Is an increasuig desire to a^jjly it to otiier 



)Copr. 1949-1998 Hewlett-Packard Co. 



FGbniar>- [Um Hi-wlott-Pajckard Jouriiiil 71 



software items, such as user documentation .., inspection 
breeds inspection.''^ 

Summary 

Reference 3 describes one ICBD Corvaliis project that used 
the HDL inspection process (liowever, inspections are not 
discussed in reference 3). 

Tine code inspection process can be applied succcssfiiJly to 
Iiardware descriptions if tiie following conditions are met: 

• A simple set of operational definitioiis is developed for the 
process. 

• Engineers are willing lo open their code to inspection and 
the process is viewed by the design community as beneficial 
and inipoilant. 

• Management gives project teams adequate time to perfonn 
inspections. 



• Best practices and guidelines are recorded and updated. 

For project teams just starting to use hardware description 
languages in the design process, code inspections can play a 
vital role in ensuring higlt-quality HDL. At ICBD Corv-allis, 
we found that the inspection process works extreniety effi- 
ciently in spreading best practices for HDL coding, simula- 
tion, and synthesis* 

References 

1. T. DeiMiirco. Cojihvlling Sqftwmfi Pnrjeefs, 1982. pp. 220-232. 

2. E.R Doolan, "Elxperienct! with Pagan's Inspection Method," 
Softwam — Fmctke and E:ipenene4>, Febrmiry 1992, pp. 173-1S2. 

3. JJX Mt^DfJUgaJ and W.I-l Yoiingj "Shoitening the Time to Volume 
Production for t!i^h -Performance Standard Cell ASICs," Hei^ietl' 
PfwkaM Jourtiui^ Vol. 4G. no. 1, February 1995, pp. 91 -96* 



72 February 1996 He wteiE Packard Journrii 



)Copr. 1949-1998 Hewlett-Packard Co. 



Overview of Code-Domain Power, 
Timing, and Phase Measurements 



Telecommunications Industry Association standards specify various 
measorements designed to ensure the compatibility of North American 
CDMA (code division multiple access) ceilular transmitters and receivers. 
This paper is a tutorial overview of the operation of the measurement 
algorithms in the HP 83203B CDMA ceilular adapter, which is designed to 
make the base station transmitter measurements specified in the 
standards. 

by Eajmicind A, Birgenheier 



In 1994. the Telecomniunications Industry Association (TIA) 
released the IS-95 mid 18-97 standards developed by the TL\ 
TR^5.5 subconmiittee. Tliese standards ensure the niobjie- 
station/base-station compatibility of a dual-mode widebaiui 
spread sppctniin system — the North Aitiericaii CDMA (ctxie 
division multiple access) ceiiiilar telephone system. ^ CDMA 
is a class of modulation that uses specialized codes to pro- 
vide nuiitiplc commiuiication channels in a designated seg- 
ment of the electromagnetic spectrum. The TL\ IS-95/97 
standards specify various itieasurements tliat must be made 
on CDMA base station and niolnle station trmismiltei^ and 
re<*eivers to etisure Ibeir compatibility The HP 8:.i20:iB 
CDMA cellular adapter for the HP 8921 A Option tiOO cell sile 
test systen\ is designed to make the base station transmitter 
measiiretnents si>erjned in the standards. The HP 83203B 
algontbins provide acnirate tneasuremems of code-domain 
power, lime, frequency, and phase. This paf)er is a tutorial 
overview of the optnation of the Tneastiremett! algorithms in 
the HP 8;B203B. 

The HP 8i32Q3B measurement algoritluns provide a ehanic- 
t erization of th<^ c*ode-doinain rhartnels of a CDMA base 
station tnmsmitter. One of the measurements, called code- 
domain powder, provides the distribution of power in the 
code channels. Tliis measurement can be used to verify that 
the various channels are at exjDccted power levels at\tl to 
determine when one code channel is le'aldng energ>' into the 
other code channels. The crosscotipling of code channels 
can occur for many reasons. One reason is a time tnisalign- 
ment of the channels, which would negate the orthogonal 
relationshijj anujug code t Itaujiels. AtK>ther teasor^ may l>e 
the inij^airmenf of the signals caused by notiideal or nial- 
fund inning components in tlie transmitter. To deteniune the 
quality of the tninsmitter signal, a waveform qtiaJity factor, 
p, is measured, tt is the amount of ti^aiLsmltter signal energ>' 
that correlates with iin ideal reference signal when only the 
pilot channel is transmitted. 



Another set of measurements* called code-domain tuning 
and code-domain phase, determine how weU-aligned the 
code channels are ut time iind in phase, Tl\e partmieters 
me^isured are time offsets and phase offsets of active code 
channels relative to the pilot chaiuiel (code channel 0), 

To make these measurements to the precision specified in 
the IS-97 standard, it is necessary to establish tlie tinie origin 
£Uid the courier frequency of the signal to be metis uied. The 
HP H)i20;3B provides these measurements. i\nother mea- 
surenient that may be useful when diagnosirtg the causes of 
[joor transmitter signal quality is the caiTier feed through m 
the tran.sniitter signal The effect of carrier feedthrongh will 
also be seen when measuring code-domain power. 

This paper presents f 1 ) the general concepts of ("DMA sig- 
nals and measurements, (2) the signal flow of the measini?^ 
ment algorithms, (3) the specifications from the IS-97 stan- 
dard mui perfoiTnance predictions for the measurement 
algorithms based un mailicnuitical nuHleling mid Simula- 
dons, and (A ) some t>i>ical results of measurements made 
with the HP S;3203B. 

CDMA Operation 

The ehmtn<^l stnieture for a CDMA base station transmitter 
Is sliown tit Fig. 1 . There are 64 code channels, correspond- 
ing tcj 64 Walsh fiuictions. each 6^4 chips long.'^ To see how 
the Walsh functions provide the channelization, we will con- 
sider a hyi>ot helical exani(>le of four ccjde chaimels pro- 
duced by the four orthogonid Widsh [unctions showTi in 
Fig. 2. The sums shown in Mg. 1 are niodulo-2, as defined in 
Table 1, They are a|)propriate when a 0,1 representaticm is 
used for binaiy numlieis and are eqiiivaleni to ordinaiy' tnut- 
tiplif*ation when a 1,-1 representation is used. The Walsh 
functions use nonretuni-to-zero (NRZ) values of 1 and -1 to 
represent binary numbers. 

' ThB cliip Intervat \t iba clock period of th^ spreading code uied m a s^redd-speetnjrri system 
\f\ this paper, a diip tonesponds to one btnary digit of ttie pilcit pseudwolee seqtjefu:es shorn 
m% t 



)Copr. 1949-1998 Hewlett-Packard Co. 



Fi*bnja£y 1 99^ Ho wlett-Piic k iin \ , 1 1 n in i : ti 7ti 



l-CKBnnet Pilot PseuilonoJse Sequence 
1.2289 Mil it3/$ 



Pilot Channel (All Zeros) 



-*f 



Walsti Fn rid ion wq 



Inputs: 
19.2 kbfis 



Co^e Cliafinel 1 



-> 6 > I S 1 



Q-Channel Pilot P^eudonoise Snquence 



►© > 

Vlahh Function w^ 



Code Channel G3 



^f^ — ► 

Walsfi Function Wqj 



Same As Above 



Sanre As Above 



cos{2]rfcU 




-siniZnfct) 



Fig, I. Farwarci (TDM A (base sialion 
t. ratismi tt, er) channel stru v. 1 ur e . 





Table 1 


ModuiO'ZSijfTiUOR) 


e 


U 1 





(1 1 


1 


1 



Tlie WaLsh functions are said to be oitliogonal because the 
inner product of w\(t) and Wj(t) is: 



wift}W|(t:) = 4, 



= 0. 



1 ^ J 



i ^ J 



(1) 



tliat is, the inner product of two distinct Walsh ftuictions is 
zero. 

The orfhogonaiity pi'operly produces the channetization, as 
we can see by considering the iransuiission of a binary digit 
(bit) that is four chip inteivals long on cfiarinel 1. [f tlie bit 
is represented by ±1, ihen at the iraitsmiltej' anti, ideaily. at 
the receiver the bit is represented by ±W|(tJ. At the re- 
ceiver, an operation equivalent to equation 1 is perfomied 
on ±wiCt}wiCt ) for each channel for i ^ t), 1, 2, S. This opera- 
tion produces the result: 



■1 

/ 



wiCt)Wt(t) ^ ±4, 1=1 



= 0, i i^ 1 



m 



The 64 Walsh functions used for the channelization shoivn \u 
Fig. 1 are represented by 64-bit words that are rows (or col- 
umns) of a 64 X 64 Hadainard matrix. The Hadantard matrix 
is oithogonal (i.e., rows or coliinins are orthogonal) and can 
be generated by the foUowuig simple aJgorilhm: 



• A 2 X 2 Hadamaid matrix is defined as; 




1 



-1 4 



1 2 3 

Time 1 (chip intervals) 



Therefore, we see that the bi! can be detected on ctiannel 1, 

but it does not appeal" on channels 0, 2, or 3, Fig. 2, Four rirthogoiicil Walsti fTjncrK.tis. 



m 



74 Febniaiy 1 0GB HewlettTpac kard Joiima] 



)Copr. 1949-1998 Hewlett-Packard Co. 



• A 4 X 4 Madatnard matrix is generated ; 

Hi- 




[0 


0"| 


1 


1 


1 


1 


_0 1 1 


0. 



(4) 



• In general, a Hadamard matrix Hin is generated ^m a 

Hadaniard niamx Hn by: 



Hit Hn 

Hn Hn 



(5) 



The inner product of two rows of Ho is obiame<l by the mod- 
iilo-2 suniming of the tw^o rows, element by element, and 
counting the difference berw een the number of (te and Is. 
where the modLilo-2 smn is the XOR opera! ion defined iit 
Table 1. For example, to obtain the inner product of rows 1 
and 2 of H.j, we perform the foUowing operation: 







(6a) 

Imter produrr = number of Os 
niinus number of Is = U 



If a i,-l representation is used for the binary numbers, then 
the inner product given by equation 6a is simply: 



1 



1 



(6b) 



Inner product = sum = 0. 



F5g. 3 shows an example of the pseudonoise encoding 
sho^Ti in Fig. 1 for code ehannel L Tlie ini>utbits, denoted 
by di, are added (modulo-2j to ibe Walsh function wj and 
then to the I-ciuutnel and Q-ehatuiel pseudonoise sequences 
ipi^ and qjjii- The resulting niodulo-2 sums are converted to 
±1 for Iii aiui Q^, where +1 rt*prt^sems binary m\d -1 repre- 
sent.s binary L Tiie discrete* fiuK* signals 1;^ and Qj^ pnivide 
the inpuLis lu tbe u^msmit flltei-s. The outputs of these Oltei^s 
are the siipenjositjon of pulses f!entered at discrete tintes tk> 
k = ..., 0, I, 2, ..,, as illustrated in Fig, 4. 



Input Bits it,: {64 Qhtps long) 



Walsh Funttion: 4W| shown} 
l-Cliannd Pseud onaise Sequenca \\p„\: 
Q'Channtil Pseudnnoise SBquence fqpnt: 



1 8 6 1 1 1 T CJ 10 1 



1 1 P 1 Q 1 ... 1 D 1 



ll.: 
Ok: 



1 -I -t -1 -1 -1 1 1 , . J I 1 1 1 



-1^1 11-11 -1-1., J t 1-1-1. 



Fig. J. Pseud ciiioisct enroding. 







ForlL_t =-1 






Fig* 4. Traiisnul filter output. 

K the pulse for Ik or Qk equals zero when t - tj. i ^ k. then 
the pulses at the oufpuls of tlie trai^mit filters do not inter- 
fere with each other at discrete times tk, k - .... 0, 1, 2, „. aiid 
we say the Iraiismit filters introduce zero intersymbol inter- 
ference. 

Tlie transmit filters illustrated in Mg. 4 introduce zero inter- 
.Hytul>ol inteilereiwe. However, the triuismit filter specified 
in the lS-95 standard does introduce intersymbol interfer- 
ence. Moreover, the base station transmitter specified in the 
standard must incoiporate m\ all-pass phase preequalizer, 
vvliich ]3roduces an as^iuiuelric transmitter pulse response, 

Tiie reason for tlie I-Q stnieture shown in Pig. 1 will become 
clearer after we consider code-domain signals. 

Code-Domain Signals (Forward Link) 

Any sinusoidal cairier with amplitude ^uid iihase modulation 
<Mm be wTitten matJiematically as: 



X(t) == A(t)cos[cOct. + a>(t)] 



m 



where cD(. = 2^^ (ft. is the carrier frequency In Hz), A( t) is 
tlu* instantaneous amplitude, and Oft) is the mstantaneous 
tJbaije. 

tising the trigonometric identity cos(6+ip) - cosBco-stp - 
sin6sintp» equal ion 7 can be rewritten as: 



X{t) - A(t)cos*(t)co90>cl - A(t)sin(^(t)siniii(^t 
= I(t)costi)ct " Q{t)sinu)tti 



(8) 



where the in-phase component of the signal (the component 
multiplying the carrier cosoiii) is: 



I(t) = A(t)cos4>(t), 



m 



and the quadrature component (the compoiient multiplying 
the (luadrattire cairier -sinco^t) is: 

Q{t) = AmsiiiO(t). (10) 

Using Euler's identity, e*^ = exi>(i6) = cosS + jsine, we can 
write: 

1(1) + jQ(t) = A(t)ef*^^l 



F('briiaiT H ^f "^ Ht^wtptt-Parkat f t .)oi inuil 7§ 



)Copr. 1949-1998 Hewlett-Packard Co. 



I(t)tiQ(t) is called the complex envelope of the modulaefeli 
carrier and is represented as a rotating phasor as shown In 
Fig. 5. The lip of the rotating pliasor moves as a function of 
time fomikig the locus referred to as the signal tj^ecloo^ 

The fbnv a rd link of the CDMA system uses quadrature 
phase-shil'l: keying (QPSKj modulation. Fnst, we will con- 
sider the case in which ordy the pilot signal is present. In 
this case, if no intersymhol interference is introduced by the 
ti'ansmil filler, ihe signal tr^ectoiy passes tluough fnnr dis- 
crete points scpiiratc (I by multi]j]ea of 90 rlegrees in the I-Q 
plane as shown in Fig. 6. Tliese four points on ttie I-Q dia- 
gram arc referred to as tht* signal constellation for the QP8K 
modulation. 

The coordinates of tliese points represent the foui- possible 
values of a pair of bits. As the signal moves along its tr^ec- 
toiy, the cooidinaies at discrete Imic t^; represent the pair of 
bits transmitted at this time. The example signal trg^jector^^ 
presented in Fig. 6 is for the first eight pairs of bits of the 
pilot sequences with corresponding times tj^, as given in 
Tabled 



Table Jl 
First S Pairs of Bits of Pilot Sequences 

12 3 4 5 6 7 8 

-1 1-1 1-1 1 1-1 

-111-1 -1 -1 -1 I 



^\)i] 



Now we wHl cotiskier a case in which the pilot (code chan- 
nel 0) and code channel 1 are transmitted simultjmeousiy. 
In this case, the transmitter signal can be represented as: 



X(t) = AoCt)cos[u)ct + c^o(t)] 
+ Ai(t)cos[ci)t.t + out)] T 



(12) 



where A(](t) and (P{>(t) represent the amplitude and phase 
modulation introduced by the pilot and Ai(t) and 0|f t) rep- 
resent the amplitude aud pliase modulation mtroduccd by 
code cliaimel L I'sing the trigononietric identity cos(9+(p) = 
cos0cos<p - sinGsintp, we can write e< [nation 12 as: 

X(t} = [Ao(t)cosOoCt) + AiCt)cos<J>iCt)]cosCLacf) 

- [Ao(t)sinOo(t) + Ai(t)sin4>iCt)]sin(ciJtt) (13) 
= I(t)cosf(ivt) - Q(t)sin(ojct)j 




Fig, 6. The complex envelope of the modulated carrier is repre- 
sented as a rotating pliasor. The lociis of the tip of the pliasor is 
called Che signal trajecto^J^ 



__— 


— -- — -__ 


^^^ 


Q '^-^ 




7 '^ 


■<s 


/ 






/ 


X 


/ 


X 


/ 


X 


y 




\^ 1 




Ax ' 


/ 


\\ 


/ 


\\ 


,/ 


\\ 


y 


\\ 


/ 


\V 


y 


\\, 


y 
i 1 w m^ 


^\(i--t) 


h^% ^---^ 


^^^.^^'^o 




IqJg.t; 



Fig. 6. Exaniple of a signal fODsLellatinii (poiiit-s) and a signal 
irajeciory. 

where 

I(t) = AoCt)cos0oft) + Ai(t)cos0iCt) (M) 

and 

Q(t) = Ao(t)sm0oCt) + AiCQsmOiCt). (IS) 

From equations 14 and 15, i! is clear that since 

I(t) - Io(t) + lid) and Q(t) - Qo(t} + Qi(t), (16) 

I(r) and Q(t) are simply the superposition of the coirespond- 
mg components produced hy die pilot and code channel L 
Therefore, we can superimpose I-Q diagrams. 

To simplify the description at this point, we will consider the 
code chainiels produced by four orthogonal Walsli words 
each four chips long, as shown in Table III. 

Tablelll ~ 

Orthogonal Walsh Words 

wq: I 1 1 \ 

w^ i A, % -1 

W2t 1 i -4; -1 

W3: 1 -1 -1 1 

For illustrative puiposes, we will assiui\e that the ijeak mag- 
nitude y 2 at) - |An(t|,.)| of the pilot (code chaimel 0) is 

0.8 v2 and the magnitude ,2 ai = \^\\X^^\ , , of the signal 

for code channel 1 is Q.6 ^ 2, so that tlie root-sum-sguare of 
the pilot and code cham^el 1 signals is: 



vaS^ + 0.6^ = 1.0. 



(1^ 



In this case, the pilot signal has the tri^ectoiy shown iit 
Fig. 6, except that The signal coordinates are (±0,8, ±0.8) 
mstcadof (±1,±1J. 

To determine the trajectory produced by code channel 1, we 
miisl: consider multiplying Walsh word Wj by data bits. For 



76 F€?bniary 1996 Hpwlett-I^ckajxl Jni ini \ '. 



)Copr. 1949-1998 Hewlett-Packard Co. 



our example, we will assume data bits for two Waish func- 
tion imenals: d = L -1. We obtain %^u^ for Ij and Qj as 

presented in Table TV. 









Table IV 














Calculation of h 


and Qi 








Time 


tj 


t2 


h 


U 


to 


t^ 


ty 


tB 


Ipn 


'1 


1 


-I 


1 


-1 


1 


1 


-I " 


%n 


-1 


1 


1 


-I 


-1 


-1 


-1 


1 


Wi 




-1 


i 


-1 




-I 


1 


-1 


Wiipr^ 


"1 


-1 


-! 


-1 


-1 


-1 


\ 


1 


WlQpn 


-1 


-1 


1 


I 


-1 


1 


-I 


-1 


di 










-1 . 








tli^^'iV 


-I 


-1 


~1 


-1 




1 


-1 


-1 


cilWlGnn 


^]^ 


-1 


i 


1 




-1 


1 


1 



Ii-aidiW|ipri -0.6 ^.6 -OS -0,6 0.6 0.6 ^.6 ^.6 -0.6 
Ql=aidiwi£^i^ -0.6 -0.6 0.6 0.6 0.6 -0.6 0.6 0.6-^.6 

First, the ip^^ and q^ji sequences are multiplied by Walsh 
word wi = f 1 -1 I ^1) repeated everj' 4 chips. Tliis result 
is then mull ip I led by the data sequence dj = 1 for the fii-st 4 
rlups antl ih - -1 foj- die next 4 chips, and finally, the two 
sequences are multiphed by the amplitude ai = 0.6. Values of 
-0.6 were arbitraiHy added for time t^j to be used later to 
illustrate the effect of time offset. The resulting sequences 
for l(j,Qo and 1] ,Q| are shown in Table V and their !-Q dia- 
grams are shown iti Pig. 7. 







Table V 
Superpositifin of l-Q Sequences 






Time 


U 


^•i 


Ui 


14 


t5 


Ui 


t7 


ts 


h 


-0.8 


03 


-OM 


o.a 


-0.8 


0.8 


0.8 


-0.8 


Qo 


-0.8 


O.S 


0.8 


-0,8 


-0.8 


-U.8 


-0.8 


0.8 


ll 


-0,6 


-0.6 


-0.6 


-0.6 


O.fi 


0.6 


-0,6 


-0.6 


Qi 


-0.6 


-0,6 


0.6 


0.6 


0.6 


-0.6 


0.6 


0.6 


I 


-1.4 


02 


-14 


0.2 


-0.2 


1.4 


0.2 


-L4 


Q 


-1.4 


0.2 


L4 


-<].2 


-0.2 


-1.4 


-0,2 


1.4 



In the above example, we consideied the situation of a 
CDiVlA signal consisting of the pilot and code cliamtel I ;md 
showed that we could obtain the 1-Q diagram ffjr (be com- 
posite signal simply by superimposing the l-Q diagrams for 
the indlviduEil signals. For oiu' example of two signals, the 
twc> 4-pomt I'Q diagrams produced mi B-point diagrani for 
the composite signal. This principle of super^iosition can be 
applied to any number of code channels and provides a con- 
venient geometric way of constructing and visualizing sig- 
nals. For eximqjle, if we consider three code chamiels with 
sigiuil mnphtudes of ii^j, ai iind a:^, then we obtain an 1-Q 
diagnun with ctjordinates (x,y} in which x and y take nu the 
eight values ±a0±ai±a2 to produce a signal constellation with 
Hi points. Wc mu.s! kee|) in mind that the above discussion 
applies only for the eoodititjn of zero intersymbol inter- 
ference. 






% 



^. 






y^ 



r 



/ 



HBA ^jJi •^ ^ : 



ti.t 



i-M 



'\^, 









(0J,-6Jt 



%%M\i 



(al 









I \ 



(-0i.-0.6)l 



^-* (G.a0,6] 



^. t 



\ 



1 I 
I 

\ ; 

tR 



lb) 






\ 



(-IA-t4l , 



y 



/ 



/ 



/ 



/ 



/ 



/ 



(1.4,1.4) 



^^ 












\ 



-On 

\ \ (tA"1-4| 



Fig. 7. Signal con.*! I pllat ion anri trajeciorj* for (a) pilot channel 
t l>j cfKle (rharoiel 1 , arid (cj the? smu of the pilot channel and 
code channel I. 



)Copr. 1949-1998 Hewlett-Packard Co. 



February 1095 Hewlett-Pat' kard Joum^il 77 



Signal Acquisition (Timing and Frequency Estimation) 

To perfonii the nieasurenieiUs of the CDMA signals, it is 
necessary to estiinate the precise carrier frequency so tiiat 
the signal to be measured caji be converted to l>a^eband, 
that is, so it ran be represented in terms of an I-Q signal t ra- 
jectory as (iisotissed above, Furtiiermore, it is necessiiry to 
deteiTiiine the timing of the signal to be measured relative to 
the zero thue reference of the pseudonoise sequences ipn 
and qpji which are used to spread the speci aim of \he trans- 
mitter signal. Tlte estimation of timing and carrier fre- 
quency are discussed in tJiis section. 

Suppose that the transmitter signai to be measured has an 
unknowi^ freiitiency error dca unknown phase %, and an 
unkno^Mi time delay Xi), so tiiat after down-conversion to 
baseband, the signal available for measurement can be rep- 
resented m the fonn of equation 7 witJi (i\. replacefl with 
ti>^-+Att)i t replaced with t-to, and a phase tenn Ou added. 
That is, the signal to be measured can be represented as: 

Xa-To) = A(t-To}cos[(tiJc+Aitj)(t-to) + 4>(t='tn) + Otj|, (18) 

which can be wTitten, using the trigonometric identity 
cosf 0+<p) = cosBcostp - sinOsintp, as: 

X(!-T<^J = 

A(!-To)COS[AtUt - ([iV+A(r))To + <I>(t-To) + 0(ilcosi;Jct (1^) 

- A(t-To)sui{Acut - (tfJc+Atij)To + <l>(t-To) + Bo]sina)et. 

From e(]iia!iun 19, we obtain the in-phase and quadratm-e 
components as: 

Ix(l) = A(l -To)cos[Aujt - (uj^.+Auj)Tt, + it>(t-T(j)+Ool (20) 

and 

Qx(tJ-A(t^Ty)sin[Atr)t^(ov+Auj)to+a>(t-Ti))+B,5] (21) 

Using Euier's identity e^^ = exp(j9) = cosO + jsinB, we can 
write the coEnplex envelope as: 



W) = IxCt) + jQxCt.) 

= A( t-to)expjj[ Aujt - (tue+ Auj)to ^ ^Ht- if >) + ih] j , 



(22) 



&om which we see that the baseband signal is a rotating 
phasor \yiih magnitude A(t-to) and phase [Acat = (a>t.+Aco)Tt} 
+ O(t-To ) + Ot)] as shown in Fig. 8. 

We see that if to ^ but Acq = 0, then the ampliinde A(t-T()) 
and phase 0(t-to} axe delayed versions of A(t) and fpft) and 
a phase shift of-ci3(^to+0[j is added. Therefore, the effet t of 
the time delay is simply a rotation of the 1-Q diagnyn by an 
migie of -to^.Tr]-i-9f] and a change of in in tlie times at wiiich 
tlie signal trajectory psisses through the constellation points. 



AU-To> 




When A&j ^ 0, the frequency error adds an aciditional phase 
shift uf-ALoto and a constant -rate phase rotation of Atat. 
The result of the constant-rate piiase rotation will, in gen- 
eral, be that the signal iR^jectorj' will no longer pass througli 
discrete points, so the 1-Q diagram will not resemble its 
counteqiait for zero frequency error. 

Tlie functions used to estimate Xq, Am, and 0q can be de- 
scribed by coasiderhig a pilot reference signal given as: 

S(t-TB,WR) = Ao(t-tR)expjj[wRt -H *o(t-TR)]) . (23) 

m wliich Ai((t) and ^tioft) are the instantaneous amphtude 
and phasi^ of I he complex envelope corresponding to the 
pilot only Xji is a variable time delay and cdr is a variable 
frequency Using t!ie observable liasehand sigrtal Y(l) given 
by equation 22 cuid the reference sigiial given by equation 
23, the correlation huKlion for tliese two signals is: 



P[jh,^h] - Xy{tk]^'h-^Ji^^'n] 



k 



m 



Tlie sample interval t^^-ti^ j used here is different from that 
used previously and, in general ^ would lie a fraction of the 
chip intei^v'al. The magmtude of PliK^OiK) ct^uld Ije rnaxl- 
mized with respect to tuand (Oj^ lo delemiine the estimates 
tq and Acij of ij) and Am However* a normalized version of 
the squared magnitude of fliis function is tised to facilitate 
the search strategy' for finding Tqh Xq is found by fonning die 
function 



IPKO}] 



..2- 



,,2 



(25) 



X|S(tk-tR,0]| XjY(tk)i 

k h 

aiifl finding the value Tr = t|) for which this function is maxi- 
mum. 

Maxhnizing equation 25 corresponds to maximizing the cor- 
relation between the obseivable l^aseband signal and an 
ideal reference signal for the pilot only. tIsualJy, the observ- 
able baseband signal wiU consist of tlie superposition of a 
number of code channels, flow ever, since the correlation 
between the pilot and the other code chaiuicls is small, tlie 
maximization of equation 25 provides a good initial estimate 
ofio- 

P(T;r, 0) is sensidve to frequency error Acu, wliich limits the 
range of Aco for w hich equation 25 can be used- We cim ol?- 
tdn an expression for the frequency response of !*(T(kO) by 
setting 



Y(U = S(t-To.Auj) 
to obtain 

P(to,01 = yA^^(ti,-Tu)cJ'^""^ 



m 



£27) 



To simpUfy the evaluation of this expression, consider sam- 
pting at pt)ints for w^liich the signal tnyectonp' passes through 
the constellation points of tJie pilot, so tliat Aq[ tit-To] is ccm- 
slant. In this case, the magnitude of PCto^O) is: 



Fig. S. Complex f^Tivelnpp of t tie t>ascband signal. 



78 F*Gbruar>' IflOfi Hewjett-Packard Joiima] 



)Copr. 1949-1998 Hewlett-Packard Co. 



N'o,o)| 



sinf^Ataj 



sinlii^Atii) 



(2S) 



l2K 



where T is the length of the data recorti used to calculate 
P(t0,O) and K is the iiuinber tjf samples in the data record. 

From the sketch of P(xo*0) in fig. 9, we see ihat P{Ta,Cl) = 
for Am = 'InfT: hi devising the search strate©^ for finding ti>. 
it was assumed that frequency errors would be less than 
±7iyT Therefore, reliable estimates of t^ can be obtain€^d 
only if 



tAtol < 



(m 



Alter the vsdue of t<> is detetmined, we obtain an estimate, 
Auj, of Act) from the discriminator formed as the ratio of the 
difference over the sum of Pii(,, Acuotj and \P{i{h -^m})h 



Ail} 



^ |P(To,AQH))|-|P(to,-A(Do)| 

TiP{To,ADio)| + |P|xo,'^Ait>o!| ' 



(30) 



where Aco^i = tc/T. The formation of this discriminator is illus- 
trated in Pig. 10, where P(To,Aca{fl is shown by the upper 
dashed cm-^'e, -P(T((,-Aoja) ts shown by the lower dashed 
curv^e, and the discririTinator ciin^ej AwT/ti, is shown by the 
solid cun^e, 

Tfie function given by equation 30 is a linear fimcLion of Ato 
for IA(])I < jt/T anfl ijrovides a reasonably go(jd initial esti- 
mate of the rretiuency error when a signlfR am percentage 
(on tlie order of 10% or ntorej of ihe tolal rransntitter power 
is contained in tlie pilot channel. 

An estimate of the transmitter phase is obtained from the 
phase of the congelation ftuietion with Th = to and cok = 
Auj: 



On 



Um 



g{P(T0,A(i))| 

!RfP(T„.Ai-))) 



(31) 






-03 



-! 







/ 




\ 


1 




/ 

/ 




\ 
\ 


\ 

\ 


^ 


/ 


/ 
/ 
/ 
/ 






\ 


\ 


/ ^ 


/ 







^u ^1 -ttj as 



1 1.5 



Fig. 10. Fummtitniofthe discriminatarof i>quadoit30- PJto^AlqoJ is 
shown by the upper dashed cun-e, =PtT«*-AciJfiJ is sho^-n by tlie 
lowi?r dashed curve, and the discriminator curve. AwT/jt . is shf>\ni 
by the sGiid ciun^e. 

wiiere Sf^jz] and 3|z) are the real and imaginaiy parts of z, 
respectively. 

Because of tlie weak cojTelation between the pilot chaimel 
and the other code channels, equations 25, 30. and 31 pro- 
vide good initial estimates of \\^ Aoj, and 9(>. The estimates 
of these parameters are refmed after the inters>Tnbol inter- 
ference has been removed by the complementary filter dis- 
cussed later in this article. Further refinement of these pa- 
ranieteiii is achieved w^hen estimating titne and phase 
offsets of the code chaimels relative to the pilot chaimel 
The estimation f.if the offset parameters is discussed later in 
this article. 

Code-Domain Power Spectrum 

The code-donudn power s]jccirum is given tn temis of the 
coefficients pj. where p, is defined as the fractional iiart of 
the transmitter ] JO w^er containetl in tlie ith code chaimel 
The fii^t step in calculating Ihe tode-domaiii power st^ec- 
truin is to nmlliiJly \{\\) and QCti^) liy 1^^ and qpn- Tlie restdts 
of these calculations ;ire showji in Table \1. 



\2. 



OS 



5 Q^ 

I 







/ 


\ 










/ 


\ 










/ 


\ 










/ 


\ 




^-^ 




\J 






W 




'3 


t 


1 


1 


. 


I 3 



OJ 



-a.3 

'3 -2 -1 

Fig* 9. Tlic ttirrelMLiun fmicMun P(tu.Q) '^^ <^ fuuction Df frequency 
error. 







Tabl 
DespreadJng 


of l|[ and Q(| 








Isl Walsh fiutction 
interval 


2nd Walsh function 
interval 


Tinie 


t| 


h 


H 


ti 


t5 


tG 


l7 t8 


KtkJ 


-1,4 


D.2 


-1.4 


0.2 


-0.2 


\A 


0.2 -L4 


QCtk) 


-1.4 


0,2 


1.4 


-0.2 


-0.2 


-1.4 


-0.2 L4 


'pjl 


-1 


1 


-1 


1 


-1 


1 


1 -I 


Urin 


-1 


1 


1 


-1 


-1 


-1 


-1 1 


^J ^ I i|>n 


\A 


0.2 


14 


0.2 


0.2 


1.4 


0.2 1.4 


2q ^ Q qpii 


L4 


0.2 


\A 


0.2 


0.2 


1.4 


tK2 L4 



)Copr. 1949-1998 Hewlett-Packard Co. 



Febmaiy tUSfi Ifcwteti-I^ctcard -jQumriJ 79 



The code-domain power spectnun is: 






k=i 






N M 



(32) 



whf^rc Z}^^ is ilie kth Siimple of the di^spread signal m the hlh 
Walsh fimt lion inlen^al, Kjk is the kth clii[j of ihe Uh Walsh 
function, M is die number of chips in a Walsh function, and 
N is the number of Walsh funttinn intpn/als in the measure 
ment inters aJ. The calculations of pj, i = 0, 1,2, 3 for the 
above example are presented in 'Dible Vli Q= v - 1)* 

Table VII 
CaEoulation of pj for the Example 



h=l 



h-2 



2hk 


Ri)k 


2iikR (ik 


Hik 


^-hkH'lk 


!.4+il.4 


Uj 


2,8 


Uj 


2.8 


0.2+ja2 


1+J 


0.4 


-1-J 


-4J.4 


h4+jl-4 


1+J 


2.8 


1+J 


2.8 


(I2+J0.2 


1+J 


0.4 


-H 


-0.4 


0.2+J0.2 


1+J 


0.4 


H 


0.4 


L4+J1.4 


1+J 


2.8 


-H 


^2.8 


il2^m 


1+j 


0.4 


i+j 


0,4 


L4+JL4 


1+J 


2,8 


-i-j 


'2,8 



Il%f = 8 



k=l 






Z XlZ,,' = 8(1.4^) + 8(0.2^)= 16 

h = 1 k - 1 


9 

1 

h = l 


4 

2^ ^iik^Ok 
k=l 


2 
= 6.4- + 0.4" = 81.92 


_ 81.92 _ 81.92 _„„, 
P" 8(16) 128 "•''* 


I 

ll = l 


4 
k=l 


2 
= 4,8- + 4.8- = 46.08 


46.08 n -^R 


92 = 


P3 = 





p^j + Pi + p2 + Pi = hOOOO . 



(33) 

(34) 
(35) 

(36) 

(37) 
(38) 
(39) 



Since we selected signal amplitmles a<) = O.S and a] = t).6, the 
tola! signal energ\' in our metisurcment inten-aJ (two Walsh 
function interv^als) is proportional to (0.8- + 0.6^) = 1.0 and 
tlie pereeniages of signal energy' in the pilot and code chan- 
nel 1, respectively, are 0.8^ = OAU and 0,6- = (136. We see, 
therefore, tliat the results of this example verify that pi is die 
fractional pait of the energy' of the observed signal that is 
contained in the idi code cliajinel. 



Errors 

Various errors will produce a transtnifter signal that does 
not match the ideal reference signal, lliese eiTors v^iil mani- 
fest themselves as a distribution of the transmitter signal 
energy Euiiong the code channels that varies from the ideal 
distribution. As mentioned eailier* the traiLsnutter signal 
may have an unknown time reference and carrier frequency. 
However, as we saw, these paiameters aie estimated so dial 
they can be removed from the signal to be measured. There- 
fore, frequency errors aaici thiie delay are compensated to a 
sufficient fiegree of accuracy to have minimal inlluence on 
the distil buUon of code-domain power. 

Other types of eiTors are not compensated. These mclude 
nsignai impairments caused by nonidetil compotients in the 
transmitter such as nonideal filters, nonlinearities, gain and 
phase imbalances, mixer spurs, qnanii^ation eriTirs, and 
others. 

Waveform Quality Factor (pi A measure of the quahty of die 
transmitter signal is obtained by measuring p, defined as: 



9 = 



2, ^kf^Ok 
k 



2]|Rtj/2:Zki" 



<m 



where Z)< is the kth sample of the despread signal, R*ok = 
1-j, and only the pilot is tiansmitted. By comparing equa- 
tions 40 and 32. we see that p and p(f m'c similai' but not 
identical. When pQ is calculated, the energy in code channel 
is found for each Wtilsli function intcnal in the measure- 
ment interv^al and the sum of Ihese energies is obtained. 
When p is calculated, the energy' of the projection onto 11'%^^ 
^ i-j over the entire measurement intei^'al is obtained. For 
random type enors, values obtiiine<i for p anci pa will lie 
essentially equal. However, certain tyi:»es of errors such as 
uncompensated frequency eiTors mil yield dilTerent values 
for p fmd pQ. 

According to equations 32 iuid 40. a fixed phase difference 
between the measuretl baseband signal and the reference 
signal will not affect p and p\. Tliis is tnie because these 
functions involve the caJculadon of energies that are insen- 
sitive to phase, that, is. 

Time and Phase Offset Errors. Time offsets and phase offsets 
of the code cliitimels relative to the pilot chtuinel arc eixore 
with tolerances specified hi 1S-9T, Offset errors in a particu- 
lar code channel will cause energ>* from that code channel 
to leak into other code channels and thereby cause a change 
in the distribution of code-doniaui power. An example of 
time and phase of&et errors is considered in this section. 

Suppose there are time and phase offsets of channel 1 witii 
respect to channel of Aii and A0|, respectively. F'or illnS' 
trative purposes, we will assume that the pulse response of 
the trjuismit filter is tii angular, as sho^Mi m Fig. 11, so the 
triinsmil filter is considereti a Ihiear mteri>olalor of afljacent 
iiiljut \alnes. We will extend our example by considering the 
effects of offsets of Aii = OJ/T^. where T,. is tiie dup mter- 
val, and AOj =0.1 radiati. We compute Ii and Qj for tins case 
as presented in Table VI 11. 



80 February 1996 Hewlett-Piackaitl JoiimaJ 



)Copr. 1949-1998 Hewlett-Packard Co. 



Table VIII 
Catculation of Pj for the Example with Time and Phase Offsets 

From timing emor (liiiearly irrterpoiaie \MJM^i curreni value, iKf^ Jiiture \ alue^ 

Tune ti t2 Is U ^^ ^ 

It -^.6 ^,6 ^.6 -OM 0-6 0.48 

Qi ^,6 S.4S OS 03 0.48 -OM 



t7 


Is 


i)S 


-0.6 


0,6 


0,48 



From phase error (TicosO.l - QjsmO.l, Ijsm0.1 + Q|cos0.1) 



II 


-0.5371 


^.5491 


-omm 


-0,5375 


0.5491 


0.5255 


-0.6569 


^.6449 


Qi 


-0.6569 


-0.5375 


0,5371 


0.5491 


0.5375 


-0.4297 


0.5871 


04177 


k 


-0.8 


0.8 


-0.8 


O.S 


-O.S 


O.S 


O.S 


-0.8 


Qo 


^.8 


O.S 


0.8 


^.8 


-^}.S 


-0.8 


^.8 


0.8 


I 


-1.3371 


0.2509 


"1.4569 


0.2625 


^.2509 


1.3255 


0.1431 


-L4449 


Q 


-1.4569 


0,2G25 


1,3371 


-0.2509 


-0,2625 


-L2297 


^.2629 


1.2177 



Multiply by ipn and Qpn to obtain Z ^ Zj+jZq 
Zi L3371 0.2509 

Zq 1.4569 0.2625 



h=l 



h=2 



h=l 



h=2 



L337UJL4569 
0.2509+J0.2625 
L4569+J 1.3371 
0.2625+J0.2509 

0.2509+J0.2625 
1.3255+j 1,2297 
0.I43I+J0J2629 
L4449+JL2177 

2hk 
1,3371+j 1.4569 
0.2509+J0.2625 
l^DeR-t^jl.SS?! 
0.2625+J0.2509 

0.2509+10.2625 
1.3255+j 1.2297 
0.1431+J0.2629 
1.4449+ j 1.2177 



.4569 


0-2625 0,2509 


1.3255 


0.1431 L4449 


J3371 


0.2509 0.2625 


L2297 


0.2629 1.2177 


Rok 


2ShkR^0k 


I^lk 


^hk^ Ik 


1-^J 


2.7940+i0.1198 


Uj 


27940+j0,U98 


l4;J 


0,5l:34+j0.0116 


-1-J 


-0.6134-i0.0n6 


1+J 


2.7940-^10,1198 


I+j 


2.7940-ja!198 


Uj 


0.5134-J0.0116 


-H 


-0,5134+j0.0116 


l^j 


0.5134+j0,0U6 


i-u 


0.5134+J0.0116 


!4'j 


2.5552^0.0958 


-i-i 


-2.5552+J0.0968 


In 


0.40604'j0.1198 


Ui 


0.4060-^10.1198 


l^j 


2.6626-;j0.2272 


-^; 


-2.6626+J0.2272 


H2k 


^hkl^^ik 


% 


^lik^^^Sk 


t4 


2,794O+jO.1108 


1+j 


2.7940+J0J19S 


i+j 


0.5134+J0.0116 


-1-J 


-OJIS4-jO,0116 


-1-1 


-2.7940+J0.1198 


-1-J 


-2.7940+J0.1198 


-i-j 


-0.5i:M+jO,On6 


1+J 


0.513^j0.0116 


i+j 


0.5134+J0.0116 


Uj 


0.5134+J0.0116 


i+j 


2.5552-J0.0958 


-1-j 


-2,5552ti0.0958 


-i-i 


-0.4060^0.1198 


-1-j 


^.4060-jO.I19H 


-i-j 


^2,6626^)0,22,72 


H 


2.6626-^10.2272 



)Copr. 1949-1998 Hewlett-Packard Co. 



February 1906 Howlett-Packajtl .Jfninial 8 1 



Tc - Chip littaival 




Fig, 11. SirnplifiGd inipulse response of the transmit tiller. 

From tlie values obtajnecl in Talile V\T\. we comptite the 
codp-cloniaia ijovver fuelTicients as follows: 



Xl^^ik'X Zl^hkN 121J648 

2, r 4 
2^ 2^ 2|>iJ(ES 



1 



2^ Z],kRlk 

k-1 



h=l 



X 2^ 2]^t<T^2k 



k-i 



2 4 
h-1 k-1 



_ 81.4575 
^0 121.1648 

^ 39.4873 
^^ 121J648 

0.0696 



Pz = 



Pa = 



12LI648 

ai504 
12L1648 



= I6.6148r + 16.1372-10.19161" 
= 81.4575 . 

= 14.56121^ + M,29B4+j0.4544l' 
= 39.4873 . 

= ljO.26281^ + 1J0.D232I- 

- 0.OG9G . 
i 

- 10.21041- + I0.2148-J0.2396I- 
= 0.1504 . 

- 0.6723 
= 0.3259 
= 0,0006 
= 0.0012 



f41) 
(42) 

2 
(43) 

(44) 

(45) 

(46) 

(47) 
(48) 
(49) 



We note that the timing aiid phase errois caused some of the 
energ>' from code channel 1 to leak into tlie other rocle 
channels, Howe\Tr. again 



This condition is always satisfied regardless of the errors 
ijitroduced to the data setinence Z =Zi+jZq. 

Estiinates of Tiine and Phase Offsets. We saw in the above 
example thai when code rhannel 1 was offset in time and 
phase relalive to ihe [»i]ni rhaimel, errors were introduced 
that caused I he relative eiiergy to increase in code channels 
0» 2, and 3 and to decrease in channel L To detennine the 
values of the offset enors, the mean squared difference he- 
tween tlie observable data. Z. and an ideal reference signal, 
R, Ls minimized. For the example considered above, the 
errors introduced by timing and phase offsets are equal to 
the difference in Zj +jZq for tlte case of no errors given in 
Table VB and the case with phase and time offset errors 
given in Tal>le VI I L I'hese enors as a function of tmie tk are 
listed in T^ble IX. 

Using the listed values, the mean squared error is: 

1 V-, ,3 

MSE=i^|Egi,+jE„,| 



B 
k= I 



m) 



= 0J3876. 

To estimate timing and phase offset errors, tlie active code 
channels are determined by calculating p[ for eveiy i imd 
identifying the chajinels for which the values of pj are above 
a preset tlireshold. For example, if a ihresliold of (KOI (cor- 
responding to -20 tlB) is used, every chaimel for which pj > 
0.01 will be declared an active channel 

In addition to determining the active code channels, it is 
necessary to detennine the data sequence djh for each actwe 
channel in which the subscript i denotes the ilh code cJian- 
net and die subscript h denotes the hth Walsh funclirjn inter- 
val in the measurement intenid. The data detector incorpo- 
rated into the function used to calculate pj is: 



Mh 



sgn ^ 




wliere 

sgn(u) = 1, u > 
= -1, u < 



m 



(53) 



and B^(zl is tlie real part of ii, The index k varies over the 
chips m a Walsh fimction inten'al (V - to 3 itt our example). 
From the values tabulated in Table Vlll, we can generate the 
detected data as shown in Table X. 



p(j + pi + p^ + P3 = LOOOO 



[50) 



H2 



Pt'l>nuir>' 19?i6 tJcwlott-Packai-d iouma] 



)Copr. 1949-1998 Hewlett-Packard Co. 



Table IX 
In-Phase IE|) and Quadrature (Eq) Components of Errors 
in Exampke for Timing and Phase Offset Errors 



Tline 


ti 


h 


t3 


U 


El 


-0.0629 


0.0509 


0.0569 


0.0625 


Eq 


0.0569 


0.0625 


-(K0629 


0.05(^ 



t5 t6 t7 tg 

0.a509 -0.0745 -0,0569 0.0449 
0.0625 -0.1703 0.0629 -0.1823 



Table X 
Calculitiorrs for Data Detection in tfie Exampte 



XM 



m 


Pi 


2_^ ZhkRok 


4 


0,1 


0,6723 (active) 


a6148 


1 


m 


* 


6.1372 


I 


u 


0.3259 (active) 


4.5612 


1 


u 


« 


-4.29S4+J0.4544 


-1 


m 


0.0006 (inactive) 






2.2 


" 






3,1 


0.0O12 (inactive) 






3.2 


♦♦ 







After the active code ctiannels and their data sequences are 
determined, an ideal signal of the fonn of equations 9 and 10 
can be generated tbr each active code channei. The in- 
pliase and quadrature components of the ideal signals are: 



Ii(t) = Ai(t)cos4'i(t,J 



and 



Qi(t) = AiCt)sintI>j(t) 



m) 



im 



where Ai(t) and <I'j(t) are tlie ainpiitude and pf^ase of tlie 
ideal signal of the ith code channel passing ihrouj^h the 
jioints (±1,±1) in the K^ diagram as siiown in F\g. G. The 
reference signal is generated by siiperinit>osiug the idea] 
signals given by equations 54 aitd 55 for each active code 
ci^annel. The i-esulting hn-phase and quadratme componentB 
of the ideal reference signal are: 

^r^KO = Z ^i^f ^ - *i^ ^^o^l^^'-Jt + ^^{t - V) + J (56) 



and 



Qrpf{t) ^ V a^Aj(t -T|)sin 



Atiit + 4>i(t-T0 + 6. 



(ST) 



where Am is frequency en'or, aj is the relative amplitude 

[a\ ^ v'pi) t Ti Ls Uie time delay, ajid Oj is the phase of the ith 
code Chan II el. The summations are over die set of active 
code channels. 

Tin? frequency error, lime delays, and phases are deiennine<:l 
by flncitng values of Aw, ritj, Tj , and Bj for ail values of i cor- 
responditig lo the active code channels to minimize the 
mean squared difference between the observ^able setiuence 
Z(tk) = ?^\(h) ^ J2q(tk) and the referent^e R(r„) = l.^Klk) + 
JQmlltk)^ wbicli Ls: 



^-ml^'^^'^'^^ 



(58) 



k=l 



where M and N are the same as in equation 32. tq and Ato 
are used to update previous estimates of time dela^^ and fre- 
quency. Estimates of time and phase offsets obtained from 
Ti and Si are: 



^^i = T^i - 'fo 



and 



(59) 



A6: - 6j - Oq. 



For the example above. \'aiues of Aci>, Oj, t^ . and 6i would 
be found to produce zero mean squared difference and er- 
ror-free estimates of these parameters. In general, however, 
errors other than those introduced by tuning jmd phase off- 
sets would be present, so tJiat after tlie minimization of the 
mean squaied difference, a nonzero residual between the 
reference and the observ^able wouki exist and the parame- 
ters would be estimated with some error in the estimates. 

Signal Flow Oiagram 

The signal flow diagram for the CDMA power, timing, and 
phase offset nieasuremeni algorithms is shown in Fig. 12. 
Ttie signal under lest from the base station transmitter is 
down -converted to a 3.6S(>4-MHj£ IF mgnid that is santpled at 
4.9152 MSii/s. Tlie digitized IF signal is passed tlu-ough a 
nniie-impulse-rcspfjnse \FIH), litu^jn-pliase, digitxd IF fihcr 
cenlcrtxi at 1.2288 MHz. Tliis litter has a flat passband 1.4 
MHz \%"ide, which is considerably wider than the i.2^3-JVniz 
bandwidth of the IF signal and provides blocking at dc and 
359,2 kHx. Indeed, Ihe primary piupose of the IP Rlter is fo 
block these signal components. 

Following the IF filter, tl^e signal is down-converted to in- 
phase (1) and quadriitnre fQ) l>asef^and signals. In the down- 
converter, the ] and Q signals are filtered by flat, FiK, linear- 
phase^ low-pass filters with passbands from to 700 kHz 
wide £md stop bands from 1. 16 to 2.0 MHz wide. Tlie full 
sample rale of 4.9152 MSa/s b retained at the outj)ut of the 
down-converter to provide majdmum accuracy at die corre- 
lator. 

The next function ailei' the down -con verier is the correlator, 
which provides an estimate of the timing of the signal under 
test. Tlie inputs to the correlator are the baseband signal 
from the down -converter and an internally generated refer- 
ence signal. Tliis reference signal is the mathematically 
ideal signal that would be present at the output of the down- 
converter if only the pilot signal were ti-ansmitled. Tlte time 
origin of th*^ reference signal corresponds la fhr first binary 
1 ff)Llowing 15 binaiy' Os of Ihe pseudonoise sequences ip^ 
and qpn , a^ spec!ified in the IS-95 standiu^d. 



Febman- 1996 He wlett4'a(*kard ,hmm u \ fiS 



)Copr. 1949-1998 Hewlett-Packard Co. 



OUT 



Base StaiJDn 
Transmitter 



Pilot 
Sequences 



Di>wii- 
CDnverter 



Sample Rale 
IF ^ 3.B8fi4 MHz U = 4.91B2 MSa/s 



Ampiifier 3ft d 

Law- Pass 

FiltSf 



IF = T.Z2^MHz 



Down- 
CoitVBrtef 



i — J 



Relerence 
Signal 



Frequency 

and Pfiase 

FreBslJmalpr 



i 



Synch ranizer 



¥ 



= 14^GMSa/s 



Frequency 
and Pfiase^ 
Estimator 



T 

dm 



pj Catculatar 
and BiL Detector 



Bits 



I 

T 

Code- Domain 

Power Pi 



. I 



Reference 

Signal 
Synthesker 



Freifiiency 

and PhiJse 

Gomjiensatof 



ComiilBfOentary 
Filler 



Complemenlary 
Filter 



Parameter 
EstimaloT 



Time Phase 
Offsets Offsets 



Frequency 

a ltd Ph»s« 

Cajnp^nsatoF 



Fig, 12. Sigfutl flow diagram lor 
the HP 832038 CDM.A power, Lim- 
ing, aiid phase offsjet measurement 

algorillims. 



The correlator performs the timing acquisition described 
earlier by findijig the value of tr tit at rnaxiniizes the f miction 
given by expression 25. Since tliis function is sensitive to 
frequency error^ the correlator works reliably over a hmited 
range of freqiienc^y If T is the length of tfie lecord (in sec- 
onds) used in the conelator, tiien the niaximuni frequency 
error for which the correlator will provide rehable acquisi- 
tion is: 



Mn 



2T" 



(60) 



For example, if a 1.25-iiis time record is used, then the maxi- 
mmn frequency error that will allow reliable acquisition ui 
time is ±Arnmx - ±4001 Iz. 

After the time delay to is determined, tiie basebaaid signal is 
time-aligned with the reference signal Tliis function is per- 
formed in the syncln'onizer, which consists of a p^iir ( lor' I 
and Qj of low-pass filters tliat resample the signals at a rate 



of 2Ad7(5 MSit/s with a vaiiable time delay to introduce the 
appropiiate timing. 

The sjTichronized baseband mid reference signals are used 
in the frequency and phase preestimator to obtain Initial 
estimates of tlie carrier frequency and phase as given by 
equations 30 and :^L These estimates are then used in tlie 
frequency imd phase compensator to largely remove Acn and 
6q from the baseband signals. 

After obtjumng a baseband signal that is coinpejisiited in 
frequency and phase, the next step is to remove the inter- 
s>Tnbol interference introduced by the transmit filter. Tins 
step is necessaiy to ensure the ortliogonality of the code 
channels to allo\v calculation of the code-domain power 
coefticicnts by the algorithm tUscussed earlier. Intersymbol 
interference is removed by the coniplenientar>' filter, whic h 
w hen cascaded with the transmit filter produces an ovenill 



84 FebruijJ> 19915 He wlett-PackardJoiiniJi] 



)Copr. 1949-1998 Hewlett-Packard Co. 



filter i^sponse that satisfies Nyquist's criterion for zero inter- 
symbol interfenence- 

After the inters>Tiibol interference is remo^Td froni the 
baseband signal by the coniplenientar>^ filter, refinwi esti- 
mates of the ("airier frequency' and phase are obtained by 
mininiixing the nieati squared ciifTererK'e t>et\veeo the base- 
band signal an<i a reference signaJ consisting of only the 
pilot. Tlie proce<lnre iise<i here is similar to that nsed for 
estimating the frequency and please in cMiryunctiiin \nt!i tlie 
time and phase oEfeets as described earlier. /Vfter the inter- 
sjmbo! interference has been remov ed. it is imnecessarj' to 
incliKie the effect of the tratisniit filters; this allows the pilot 
sequences to be used directly as the refereRce signals. 

After tiie refined estimates of carrier fraiuency and phase 
are obtainetl iJie bttsebaiid signal is again parsed ttirougli a 
compensator and a complemeiKar>^ filter lo improve tlie 
removal of frequency error, phase error, and inters\^ibol 
interference from the baseband signal. 

Following this second stage of compensation, the basebancJ 
signal is ready io be ased for calculating pj as described eiu- 
lier. This function is performed in the pj calculator shown in 
the signal flow fhagiam. Data bits aie also detected in this 
fiuic'tion that are needed t(j t^alculate Oie reference signal 
used for estimating time and phase otfsets of code channels 
as describetl earlier, Tliis f miction could also be used to cal- 
culate the wavefomi quality Factor p. However, this pai'anie- 
ter is actually calculateci by another funct]<.»n developed for 
the HP 8:320*3 A using the prot^edure given in an e;uiier sec- 
tion. 

The final steps in the signal flow diagrant involve determin- 
ing the time offsets and phase ofTsets of the active code 
(lianneis relative to the pilot channel To estimate these off- 
set paraineters, it is ne<*essary to generate an ideal reference 
signal cf jn'espondit)g to the active code chamieLs in wliich 
the;unj)liiudes. t>iiases, time delays, lwhI fretjuem ies of all of 
the co<:le ci\ai\ncls iJi the reference signal can be controlled. 
Tlie fLmction Uiat generates this ideal reference signal, re* 
ferred to as the wfetwwe signal synthmizeTr is invoked by 
the i>aranieter eslimatcn; wliich uses a seaiTh procedure to 
jiunimize the mc^iui squarefi difference between the base- 
band les! signal and the synthesized reference signal as de- 
scribed earUer, 

Accuracy of the Measurement Equipment 

SjHM illi alions lor tlu- HP H;t:^tWB UIP SJI2LVGO0) aie war- 
ranted perfonnance. These .specifications are derived from 
the accuracy of the measurement iilgoiithms. en \riron mental 
considerations, measiiiement uncertainties, iinit-to-iuiit \'aii- 
ations, ami customer sjiecificat ion margins. 1\im( a] iierfor- 
mance of the 111' 8^3203B is significLUitly better llviui the pub- 
lished specifieaiions. 

The niininmm performance of a base station transmitter is 
specified in flie fS-97 standard, hi section WA.Z of this stan- 
daid. Table 1 1.1.3.1, reproduced here as Tal>le XI, specifies 
the fretiucncy tolerance, {\mv tvft'ienct% pilul wavefonii 
quality, mid RF power output variation. 



Table XI 

Envirofimentaf Test Limits 
(from Table 111,3-1 in lS-97 Standardl 


Parameter 


Umit 


Frequency' 
Tolerance 


± 0.05 ppni 


Time Reference 


±10ps 


Pilot Waveform 
Qualit>' 


p> 0.912 


RF Power Output 
Viiriadon 


+2 dB. -4 dB 



The carrier frequency of the RF signal to be tested is ap- 
I>roximately ^K)tJ MHz. so the frequency tolerance given 
above corresponds to an absolute frequency tolerance of 
±45 Hz. Smce the HP 8-'J203B can acquire a signal and accu- 
rately estimate the frequency' error when the frequency er- 
ror is as large as ±400 Hz for a L25-ms measuremeni inter- 
val, frequency enoi-s within the above tolerance are easily 
accontmodated. 

The tolerance on pilot waveform quality significantly im- 
pacts the accuracy of the measurement algorithms. Eijor- 
vector-niagnitiide-squaj*ed (e\Tn-), which Is defmed as tlie 
mtio of the energy- of the error to the energy of tlie error-free 
iransmit signal, can be shown to be approxhi^ately related to 
the waveform quality factor, p , as: 



-- 1. 



Ff>r f lie \ all le of p - 0iil2 in Tahle XI, 



^™^ v'S^^^-^^^ 



(61) 



(62) 



that is, the waveform quaUty specified in Table XI corre- 
sponds to a signal with an rms error of ai>proxiniately Zl%. 

Otlier errors that impact the accuracy of the measuremeni 
equipment are time errors and phase differt*rices i>etween 
tlie jiilot channel ^uid other code channels. Tolerances on 
these en'ois iire given m sections 10.3, L2.3 and 10.3. L3J3 of 
the 18-97 standard as less than ±50 ns for time errors and 
less than ±50 nirad for tlie phase differences. 

The accuracy of the waveform quality measurement ecjuip- 
ment is specified in Table 12.4.2.1-1 of the IS-OT stiuidard, 
rejieated here as Tahle XII. 

Wavefomt quality is measured when only the pilot is trans- 
millet I. We will discuss the accuracy in measuring each of 
the pmanieitn-s tisred abo%'e aiid tbe measurement interval 
necessar>^ to achieve the perfonnance specified. 

To measure cotU^-domain jfowei; test mfKlels for the base 
station fue spi^cilled in Table 12.5.2-1 of the IS-07 standard, 
reproduced here as Table XIH, 



F(M>ruarv ItHIB HcwleU-F'atkanl Jouniiil 85 



)Copr. 1949-1998 Hewlett-Packard Co. 



Table M 

Accuracy of Waveform Quality Measurement Equipment 
(from Table 12A2.1-1 in the IS-97 Standard) 



Parameter Symbol 

Wavetbrrn Qiiaiitj^' p 

Frequency Error M 

(exckislve of test 
egiupment 
tinie-ba.^p erroi-s ) 

Pilot Time Al i gii ni ent to 



Accuracy 
Requirement 

±5 X iO~^ from 
0.9 to LD 

±10 Hz 



±135 ns 



Table Xm 

Base Station Test Model, Naminal 
(from Table 12,5.2-1 in the IS-97 Standard) 



Number Fraction Fraction 
of of Power of Power 

Type Channels (linear) (dB) 



Pilot 

S>T1C 

Paging 
TraJfie 



0.2000 
0.0471 

0.1882 
0,09412 



^7.0 



Comments 



Code channol 



-13.3 Code channel 
32, always 
1/S^rale 



-7.3 



-10 



Code channel 
1, full-rate only 

Vaiiable code 
channel 
tissignments; 
riill-ralc only 



The ineasureoient algorithms have been tested and found to 
provide accurate results for signals with less than 10% of the 
power in tlie pilot channel; however, iii dLscussiug the accu- 
racy of the measurement algorithms in the next subsection, 
we i^ill only consider perfom^ance mider the conditions 
prescribed by the nominal test ntotiel 

The accuracy required of the code-domain measuremeni 
equipment is given in Table 12,4,2.2-1 of thelS-97 staJKlard 
using the nominal test modei given above. Tliis table is re- 
produced here as Table XI\', 

We will discuss tlie accuracy of measuring each of the pa- 
ranietei-s given in Table XTV' and give the minimum meastire- 
naeni iuienr'ais and immber of suliestimatcs tliat must be 
averaged to achieve the accuracies specified* 

Accuracy of the Measurement Algoritlims 
Dynamic Range. The flatness of the filters and the numerical 
accuiacy of the computations used in aU of the signal pro- 
cessing algorithms for the HP 83203 B are closely maintained 
to produce a computational eiTor level of approximately 
"55 dB. Since this enor level is typically less tlian the level 
of the spmlous signals and quantization noise mtroduced by 
ihe analog do%\ii-c on version process and tlie analog-to-digital 
conveiter (ADC) used ro digitize the IF signal under test, the 
dynamic range of the HP 83203B is limited hy rhe noise and 
spurious signal level at the outi>ut of the ADC. The ADC 
uses aufciranging to maintain the signal level at the input of 
the quantizer at -1 dB to -10 dB from satiuation, With the 
ADC operating at -10 dB below^ saturation, the noise and 



Table XIV 

Accuracv of Code-Donratn Measurement Equipment 
(from Table 12.4.2,2-1 in the IS-97 Standard) 

Accuracy 
Parameter Symbof Requirement 



Code-d orn ai n p o w er 
coefficients 



Fre<[uency Error 
(exclusive of test 
equipment time-base 
errors) 

Code-domain time 
offset relative to 
pilot 

Code-domain phase 
offset relative to 
pilot 



±5x10-^ 
pi from 6 X 10^ 

to 1.0 

Af ±10 Hz 



AG, ±0.01 ntdian 



spurious signal level at die output of the ADC is ap|>roxi- 
mately -45 dB relative to the digitized IF signal. Tlverefore, 
the atxalog and M)C hardware places a limit on the dynamic 
range ot the code-domain power measiuements of approxi- 
mately 45 dB. 

Accuracv in Measuring p and pf. The accuracy in measuring 
waveform quality p and code-domain power p^ dei>ends on 
the accuracy of estimatmg time delay T(] and frectuency error 
Aoi Tlie errors in the nieasiu'ement of p produced hy errors 
m estimating in and Aco are sho^^Ti in Figs. 13a and t3b for 
measurement internals of 1.04 ms tmd 2.08 ms. The error 
curves correspond to transmiUhig an ideal pilot channel for 
which the true value of p is LO. Since the percentage error 
In the measurement of p caused by frequency and timing 
errc>rs is independent of the true value of p, the eiTOr cmir-es 
Iiresented here apply to values of p from p ^ hO to p<0.1. 
Fiom Table XII, we see that the required measurement accu- 
racy specified in the IS-97 standard is ±5 x 10"^ for p ^ 0.9 to 
1.0. Tliis tolerance corresponds to a measmement error of 
"33 dB for p = i.O, which is sho\^Ti in Figs. 13a and 13b. 

Accordmg to Table XII, frequency error must be measured 
to an accuracy of ±10 Hz and pilot time aligmnent must be 
measiued to an acciuacy or±l35 ns. The uncertainty in the 
lime reference of the ADC mid erroi^ of the timendelay esti- 
mator cont ribute to the measurement errors of pilot tune 
delay. In die HP 83203B, the ADC will contribute less than 
±125 ns eiTor and the time-delay estimatoi' wUl contribute 
less than ±10 ns error to the pilot time alignment measure- 
ment. Therefore, foi' puri^oses of determhung the accma- 
cies m measuring p and p,, we can assimie that limits on the 
errors of the measurements of in and Aco are: 



and 



-10 ns ^ fn - rn < 10 ns 



-10 Hz < Adi - Aii> < 10 Hz. 



163) 



From the eiTor cur\'es hi Pig. 13, we see that if the toler- 
ances given by equation 63 are achieved, then for a measure- 
mem in!er\al of 1.04 ms, the accuracy requirement fnr mea- 
suiliig p is acJiieved. If a measureinen! inten^al of 2.08 ms is 
used, then a timing error of < 10 ns is satisfactory. However, 



86 Februarj' 1 99(5 Hewlett-Packiwri Joi i n ) ; 1 1 



)Copr. 1949-1998 Hewlett-Packard Co. 




(il 



-25 -2D -15 -10 -5 S 10 IS 20 25 
Frep&rtcy frrai (Hi) 




-30 



-20 



-10 10 

Timing Error tn$9 



lb) 



Fig. 13. ErmT& in the rneasuremeni of signal quaiiry prwlnred by 
errors m (^silriiating (a) to and (b) Am for meastirerEient uUei-vals of 
L04 nts and 2M ms. Thfi e^rror curves cDrresponcl to transntittirig 
an Uiesd pllnt. rharuip) ffjr which l.ht* I rue value of p is 1.0 and im^ 
valid for p =^ (h I to p = LO, 

for tlic longer niea-surenient interval it is necessary' lo re- 
duce the toleraJice of the frt^iuency error to < 6 Hz* We can 
effet'tively gK a knigcT nieasurenient interval and avoid the 
tigliter toieiaiKi* on frequency error by averaging several 
measurements, as considered later. 

The en-ors caused in the nu^LLsurenieni of po by errors in 
estimating To and ^o) aie presented in Figs, 14a and 14b, The 
errcir curv'es correspond to transmitting an ideal pilot in 
which the tnte value of p^ is LO- Tliis is the same as the 
signal model used for tlie curv cvs in Fig. 13. We see that tfie 
errors cansed by timing and frequency enors are relatively 
insensitive to rJic measurement interval when measuring 
code-domain power The reiison for this is the difference in 
the lengths of the coiTi*latoi^ used for the code-domain 
power mul waveform <tualily tralnilatictiis. For code-domain 
power, coiTelated energies are computed over subint(*r\'aLs 
one Walsh fmulitm infi^rviil in length ;md then 20 of thest^ 
energy coni|ni tat ions are averaged in the ctise of the l.04-rns 
measurement hitenial, tjr 40 are aveiagcd in the (uhq of the 
2.0B-ms measurement hiteival. For ttie waveform quality 



caictilation^ the correlated energy over the entire measure- 
ment inters ai is computed. Because the length of the corre- 
lator used for p is a factor of 20 or 40 greater than the length 
used for p|, the measurement of p is much more sensitive to 
tmcompensated frequency' errors than the ineasurement 

Ofpfc. 

FYom the eiror cun^es in Fig. 14, we see that if tlie toler- 
ances given hy equation 6-3 are achiev^ed, then tiie accuracy 
requirement for po gi^'en in Table XJ\' is arhievett. Again, as 
vvith p, tJie pf^rcenlage error in measuring pa is Independent 
of the true value of p^. 

The cun es in Fig. 14 were obtained for po- However, since 
aH code chamiel meastirements experience essentially the 
same sensitivities to timing and frequency errors, these 
curves apply to any pj^ i ^ 0, 1, „., 63 ^iviU^in tlie dynamic 
range of the equipment. 

Since the riynamic range of the code-domain power mea- 
surement equipment is approximately 45 dB, precise v^ues 
of code-domain power, well within the tolerances specified 




laf 



Frequency Error tHzl 




-50 -« -30 



lb) 



-2D --10 1« 20 
Ttming Error tits) 



Fig. 14. Errors caused in the measurement of pa by emars in 
i.'NUiuaung (a) x\\ and (b) Aia The error cuives correspond to 
iraiismitting an ideal pilot iii wlilch the true value of po ^ i-'J 
iimnw signal nmiM as for Fig. 13), The results for pi for i ^ o 
are esspufialtv the same. 



Fettniiuy 1 t;J9fi Ht^ w] (^ ( t -Partcard Joiim n} 87 



)Copr. 1949-1998 Hewlett-Packard Co. 



by the- IS-97 standard, ran be obtained for pj ^ 1.0 to pi ^ 
B/Z X 10"^'^ if the tolerances on the estimates of timing and 
frequency errors ai'e satisfied. To observ^e code-do main 
power to a level of -45 dB^ it, would be necessary to use a 
test signal with a wavefonn quaiitj'' factor of p > OJMMil, 
where the errors arc uniformly distributed in jHJwer f.>ver the 
64 code channels. 

The measurements of p and pi may have error components 
that are rtindom. Moreover, if a sequence of measurements 
is made from independent data records, then the nuidom 
en'ors for the independent records are tm correlated. To re- 
duce tlie raiidoni eiTur cuinponents added to the measure- 
ments of p and Pi. averaging of a set of nieasurements ob- 
tained from the indei)ejKlenl recorrls can he i:5ei'formed. To 
perform this averaging, it is Jiot appropriate to average the 
vaJues obtained for p and pj directly, since tliis woiiUI intro- 
duce a bias to the tinal residt. Rather, the energy terms con- 
tained in the numerator and denominator of equarton 40 for 
p and ctjuation 32 for pi are averaged sepaiately, aj^d tlien 
the Imiil values are otytained as the ratios of tliese averages. 
This mode is referred to in the IIP S3203B as '"Fast Code-Do- 
main Power witit Averaging." 

Accuracy ^n Measuring ATj and AOj-Tlrie performance of the 
iUgorithins for M^e code-domain parameter estimator wtis 
tested by perfojinir^g simulations in which Gaussiatt rantlom 
enors were added to ilu^ simulated transmirting signab. 
A theoretical exi3ressir>n was cierivcd for the staudairl devi- 
ation of the estimates of phase offsets, ABj, based on the 
same Tuathematical model used for tlie simulations. It was 
found that the results obtaineri from the sinmlations agieed 
ver>' well wiUi the residts obtained from llie theoretically 
derived equation, with differences of less than 10 percent. 
Moreover, it was fonnd that the error in estimating tunc off- 
sets, Atj, when nieasiired in nmiosecondSj was approxi- 
mately one-half the erroi^ in measiu ing phase offsets mea- 
sured in miliiradians. Since the tolerances on measurement 
arcuiacy given in Table XIV are ±10 nanoseconds for time 
offsets and ±10 miliiradians for phase offsets, the measure- 
ment inteivai is governed by the acemacy reqmrement for 
phase oftsets. To measiu'e time offsets and phase offsets to 
the accuracy specified in the standard, it was found neces- 
sary^ to average subestiu\ates of these parameters. A note- 
worthy outcome of the perfonn^mce analysis disctjssed 
hereiu is that the algorithms designed for the code-domain 
paranieter estimator indeed minmiize the sum-square differ- 
ence between the actual transmit signal and the estimated 
ideal transmit signal, as si>ecified In the IS-97 standatd- 

The expression derived for the nus error of the estimate of 
the phase of a co(ie chaimel is: 



% = 



l_evni_ 
-vBNT 



m) 



wiiere evin is the effective error-vector magnitude, wMch Is 
equal to the ratio of the total energy C)f the enor di^ided by 
the energy of the code channel signal in question^ B = 615 
kllz is the bandwidth of the baseband transmit signal. T is 
the measniement Intenal for one subesthnate of the phase, 
and N is the number of subestimates avei'aged to obtain the 
estimate of phase. 

The w cast case occm^ for the sync channel, which for the 
nominal test model given in Table XIU has 4.71% of the total 



transmit energ>^ If the waveform quality factor for each ac- 
tive code channel is p = 0.912, f lien tlie effective qywt for 
the sync, chamiel is given approximately as; 



2 1/P 
evm" = 



-1 



0.0471 



2.049. 



m 



If the measurement interv^al is T = 2Si ms (2,2 ms w^as used 
in the smiulations) and die number of sul jestimates aver- 
aged is 34, then the resultuig ims error of the estimate of the 
phase of the sync channel is: 



1 



2,049 



'i.50 mrad. 



Bsync 2 y (615}(:34)C2.0) 
The effective e^^n- for the t>ilot channel is: 
1/P~1 



evm- ^ 



0,2 



= 0.4825 , 



m 



(67) 



from wiiirh. f(jr the same conditions as for the s\tic channel, 
we oljtam the nus error of the estimate of the phase of die 
jjdot channel as: 



^1 / Q.4825 
^fjpilot 2y (G15)(34)f2 



.0) 



1.70 nuad. 



(68) 



Since the phase offsel of tlie sync channel is: 



m 



sync 



e 



sync 



-e 



pilot i 



and %yj\€. and OpUoi are tmcorrelated. 



Atdsynr 



y ft 






= Zm mrad. 

The estimates of phase are obtained from the sum of 25 sub- 
estimates in which the errors in the subestimates are essen- 
tially independent. Therefore, the estimate of ph^ise offset is 
well-approximated as a Gaus.siaii random variable. Using the 
Ganssian ypproximation, the 90% confidence inlenal for the 
estimate of the phase offset of the sync channel for the nom- 
inal test model is: 



99% confidence interv^al = ±2. 67a, a 

AHsync 

- ±10 mrad. 



m 



The measurement acciii'acy reQutrement for iOj given in 
Table XIV is tm absolute ±10 miliiradians. If we mterpret this 
as I he W% confidence inter^'al, then the accmacy require- 
ment can be achieved by averaging 34 estimates obtaineri 
using a 2.0-ms measurement inteival as demonstrated by the 
above example. Other conibiiiations of N- and T can be used 
TO achievp the reciuired accuracy, pro\ 1fle<i tiiat the value of 
T is not too small to allo^v acquisition of frequei\cy and lim- 
ing. It is recommended that a measinrement inter\'al of 
T> 1.0 ms be used to obtain reliable performance. Other 
combinations of N anfi T that will allow measinement errors 
for iSB\ ^)f less than ±10 mrad are presented in Fig. 15. As 
pointetl nut above, if A9| is measurefl to the acciuacy re- 
quired, dien die accuracy reqiuiement for ^iii will also be 
acliieved. We wish to emphasize that the aci'uracy of the 
measm-ements of Atj and ^0j depends on die wa\ efomi qual- 
ity and the percentage of power in the code channel being 



SS Fetirniir^ 199^3 Hewlett-Packard .loumal 



)Copr. 1949-1998 Hewlett-Packard Co. 




30 40 50 5C 

Number of Averages N 

Fig. 15, Low^r bounds ori NT for ABi measuremetit errors less 
thill I ±10 n^illiradians for various confidence lev^els, 

tueiisured. Tlie curves in Fig. 15 represent a worst-case situ- 
afion in which ihe wa\^eforni quality is p = 0.912 for all code 
cliaiinels and only 4-71% of Uie transmitter power Ls con- 
tained in the code channel being measured. For other test 
models, the lower bouitds on NT can be obtained following 
the examijle given above aniL for laiger values of p. wotdd 
be significantly lower than tfiose given in Fig. 15, 

Accuracy in Measuring To ^nd ^ox The accuracy in measuring 
p ami p^ is primsnily di^penflein on the accuracy of the esti- 
mates of t|) and AO) as shown in Figs, 13 and 14. If T^ and Aco 
were obtained precisely, then the magnitude of the errors in 
the \alues obtained for p and pi wouJd be less than 10"^, 
which is well within the acciu'acy specified for the 
HP 8;i20:iB. 

The best accuracy for the estimates of i(( and Ag) is obtained 
when the full paiameter estimator is employed to estimate 
the time and phase offsets of code chamiels. In this case, T| 
and 6( iue delcnnincd for all ai'tive code chtmneis and the 
eslimale of Amis obtained jointly with the esthnates of tj 
iuid Sj. 

The next best accuracy for the estiniaies of to and Am is 
obtained by using a reference signal synthesized as the suin 
of tlie reference signals for all active code channels, ^is is 
done for tiie full ijaratneter estimator, but with llie tijue aj\d 
phase offsets set equal to zercj in ilu^ | caramel er estimator 
This procechire i^educes tlie seai't h ft>r phasc^ and timing 
froin a 2K^iinensional problem, where K is the number of 
active code cliannels, to a 2-dimensional problem. 

The accuracy of the esliinates otxu and Acu was rletermined 
through siniulatiftns in which the nominjil signal model was 
used with random lime mid phase offst-ts ijUnjduced to lite 
code chaiuiels atid a measurement interval of 1JJ9 nis. Tim- 
ing and ptiase offsets tJiat were uniformly distributed over a 
range of ±5t) ns for time fjffsets and ±50 mrad for phase off- 
sets wert* imroduced. The re*Hiitts of tliest* simulations aie 
pres€^nt4^d in Figs. 10 and 17, which show the mis enors of 
tlie estiJTUites of To mid Aejv respt^ctively, iis f mixtions of p. 
From Fig, U>, we see that the estimates of To obtained from 
the 2-dimensional pai^ameter eslimatxjr are nearly as accu- 
rate as tliose obtained from the full 2 K- dimensional parame- 
ter estimator. On Hie other hand* we see from Fig. 17 that 



!0 



s 



I 

1= 



4 - 



1- 



2-Oini€nsio-iia] Parsineier Estimator 




2K' Dimensional Parameier Estimator 



as 



QMS QJ a.95 

WavefDfin Qiiiity Fftctor p 



Fig, 16, Rhus error of the osiimaLe of ti> a^ a fmitiion of signal 
quality pf determined througli simulations in ulwch the nominal 
signni modei wiis used \^ilh random lime offsets of to ±50 m and 
phase offsets of to ±50 mrad introducc*ct to the code channels 
and a measurement iJiterval of L09 nis. 

the fiUl parameter estimator provides roughly a fac'tor of 
twTi less eiTor in estimating fretjuency compai'ed to the 2-di- 
mensional pai'ameter estimaton These cun es show that 
there is little advantage in using the full parameter estimator 
mi less time imd phase offsets are outputs of the measure- 
ment* Tl^erefore, the second methtjd of obtaining estimates 
of To and Am is reconmiended w^hen measuring code-domain 
power widiout measming time and phase offsets. A mode 
in the HP S3203B referred to as 'Accurate Code- Domain 
Pow er" employs this second method of obtaining estimates 
of t(j and Atii 

The third method for obtaining estimates of Tq and Aco uses a 
reference signal consisting of only the pilot signal. This 
mode is referred tcj as '^Fast ('ode-Domain Power" in the HP 
8^?203B. If only the pilot chaimel is transmitted, then this 
niode is as accurate as the other two imi\ is appropriate for 
measuring code-domain power- Moreover* if tq and Aa> are 



Oimafliianat Parameter Estimator 




M 



0.65 OJ D.9S 

WBVflfDrm QuBlirY Factor p 



Fig, I?* Rnis f-rror of die esiimaie of Act) ns a function oFsignat 
quality p, deu'rniined through simulaliuns using the same liignal 

moiU'l R^ for Fig. Itl 



fbfaruaiy 1(*96 H«^wlHl-IVkari| JniimaJ H§ 



)Copr. 1949-1998 Hewlett-Packard Co. 



kriowm a priori, then the "Fast Code-Domain Power* mode 
should be used. 

Presented in Fig. 18 ai'e curves obtained froni simulations 
showing the mis error in estimating to and Am for the case 
in w^Mch only the pilot channel is transmitted and a niea- 
surcmen! intervaJ ol 1.09 msls used CurUnmly, liiefse curv^es 
show that tlie timirtj^ enors in rs and the frequency errors in 
Hz are nearly identical. If we assunie that the aie;isiirenient 
errors are Gaussian ^ then we can obtain tiie 99% coididencc 
limits for the nteasiirenient of to and Aw by nnjltiplying the 
rms values given in Fig. 18 l)y a factor of 2.57, To obtain the 
mea-iinejueiu error of less fhan ±10 ns for t(j and less than 
±10 Hz for Ao) as specified in Table XIV with a confidence of 
99%, the mis errors in measming tq and Ao) must be less 
tlian 3.9 ns f(jr T(j and less than 3.9 Hz for Aca From Fig. 18. 
w^e see tiiat Iq and Aco can be estimated to sufOcicnt accu- 
racy for 0.85<p<LO using a measurement inten^al of 1.09 ms. 
This exceeds the range of 0.9<p<L0 specilled in Table XIL 

Referrmg to the perfnnnance cnrv^es ui Figs. 16 and 17, we 
see that if p is less than appr oxijnately 0,97, then the perfor- 
mance given by these curves may not be adequate. If k is 
necessaiy to obtain better est! mates of Xo and Ao) than those 
given in Figs. 16, 17, and 18, then it wiU be necessary to use 
a longer tneasurcment iuten-al than the 1.09 ms considered 
here, or to average estimates olitained froju indepenflent 
time records, as is done for the tune m\d ptiase offset mea- 
surements. As for the time and phase offset estimates, the 
mis errors of the estiinates of t^ and Acj are proportional to 

i/;nt. 

Measurement Examples 

l^T3ical results obtained with the HP 8921 A cell site test set 
using the IIP 8:^2 03B nieasurement algoritlnns are presented 
in Figs. 19 and 20. These results are not hi tent let! to vahdate 
any particidai' base station, i>ut are presented only to illus- 
trate actual measurements obtained using the algorithms 
discussed in this paper. The results presented in Fig. 19 
were obtained front a base station transmitter in wliich die 
pilot, paging ti\annel 1, syn£ chanaiel 32, and one hiil-rate 
tr^Lffic chaimel 1 1 were active. From Fig. 19a, we see that 
the floor of the code-domain power is at approximately 



Timing Error 




0.B5 0.9 0.95 1 

Wavefi>nn Quality Factor p 

Fig» 18. Curves oin^airted from simujations shuuiiig Lhe rms error in 
estinmting To and Aqj for die case in which only the pilot cluumel Is 
transmitted md a measurement ijUen/al of 1.09 nis is used, 



-38 dB relative to the total transmitter power which coire- 
sponds to a relative error energy* level of -iJS tlB + 18 dB = 
-20 dB. The factor of 18 dB corresponds to the distnibudon 
of energy to (j4 code chaimeLs. The floor level of -38 dB cor- 
responds to a value of p approximately equal to: 



1 



I + 10 



-2.i} 



0,990 L 



m 



The vahje of p measured was 0-&8S2. From the measured 
value of p we ciiii calculate the approxhnate value of the 
floor level of the code -domain spectnmi as: 



Floor Level -- 101ogio(l/p - 1) - 18 



{73J 



which agrees closely with the floor level we see in Fig. 19a. 

F'rom lhe plot of code-domain powei^ in Fig. 19a, we see tliat 
code cliaimel 33 is significantly above the floor, even though 
code channel 33 was not active. This is an indication tliaf 
the active code channels were leaking energy into code 
chamiei 33. It should he pointed out that Lhe base station 
was ovenhlven during tliis measurement, which could be 
seen from a measiurement of tJie spectrnrn of lhe trans- 
mittetl signal Tlie plot of the measured spectrum is not in- 
cluded in this paper. 

Measurements of time offsets and phase o:ffsets obtained for 
a measurement inten^al of 1.25 ms are presented in Figs. 19h 
and 19c, For these measurements no averaging w^as used; 
t-herefore. the value of NT to use in equation M to determine 
the accuracy of the measurement is NT = 1 .25 ms. Tlie chan- 
nel with the smaUest energy level was the sync chaiuiel 32 
for wliich the relative measured energ>' level was -12.8 dB. 
This corresj>on<:Ls to 5:25% of the energy m the sync chamiel 
By using equation 05 with p = 0,9882, w^e obtain an effective 
evTir for the sync channel of: 



.^ - 



1/P 



1 



0.0525 



- 0.227. 



m 



Using this value in equation 64, we obtain for the nns en'or 
of the estimate of tlie phase of tlie sync channel: 



flsync 



M 



0.227 



(615)(L25) 



= 8.6 nirad. 



m) 



The relative power m the pilot chaimel was -L41 <^ which 
corresponds to 7.73% of the total energy- in the pilot. By fol- 
lo\\dng tlie above procediu'e for the pilot cliatinel, we obtain 
the iTiis error for the esthnate of the phase of the pilot chan- 
nel 



fTA_., ^ = 7,3 nirad. 



m 



Using the rms errors obtained above in equation 70, we ob- 
tain die rms eiTor hi the measurement of the phase offset of 
the sync chimnel 



CT^ - V 8.6- + 7.3- = 1L3 nirad. 



(77) 



and by using the Gaussian assumption used for equatinn 71 
we obtain: 



90 Fcbmiuy mm HpTi'lptt-Packaixl Jouinar 



)Copr. 1949-1998 Hewlett-Packard Co. 



....L 


3 
-^ — --■-■■> 

'Trttflff 


: : : 


f - 


. — CODE DOnRIN 


flNftLr^ER — 1 
- 1 9 . § ? 

-32- 7 




i--"-4 f-- — f- 


T - z z - e e.fi 


'Tfrfilir 


. — ^^..„ — 4- .-J.... 




RF GEM 
f^F RNL 
RF RHL 
SCOPE 
SPEC RHL 
EHCODEf? 
DECODER 
PRO 10 IHT 


TUTfTIrn im.^ 


>flf ■hrrtlini 1 




Com r o 1 ^. 


kaliti Chan 


L V I ^I^H ' 


HH .3 r k c- r 1 


- I '.' . ^ 4 








1 


n .:..■.. 1 


1, 








' 




J 



(al 



CODE DO lift IN RNBLYZER ^^ 



Cont rols 



-X. 



Walsh Chan 



T i n^ 



-15.97 



Fr^Of _ 
62. 9 



CqrFT 
-32* 



To SCf^cO*^ 



RF GEH 
RF RNU 
RF RNL 
SCORE 
SPEC RNL 
ENCODER 
DECODER 
RRDIO IHT 



m 



t — , r . — ^ , ; r— CODE DOHRIK 


RHfSLV^ER ~t 
-19.95 

F r <i f Ii^H 

33,0 

CarFT EI^H 
-3S. 8 






"III: 


LrS. K .„.L..^.. ■^"1****.*.* 


,„„^.,„..„„.^.„.. .^, .„„^„.. 














RF GEH 

RF RNL 
SCOPE 
SPEC RHL 
ENCODER 
DECODER 
RRDIO IHT 




Cont. r o 1 £ 


Wolsh Chan 


91 - B4 




1 




ilsia^^H 



tcl 



Fig. 19- iiesuUs of code-doiiiain 
in t^iks lire merits of a base statiDii 
(rarl,srl1ill.f^^^vill^ Lht^ pilot fO), 
j>a^ijig channel (1), s>^lc chaiinei 
(32], aitd tme fuU-rate tmKlr 
tihutinel (11) active, (a) Code- 
fkmiain txmer inGasumrnents. 
(b) Tluw olfsei measurempnts. 
(v.) Piui.se offso! measiireni^nts. 



99% ronfideiif^p inten-al = ±2.57a./v 

Atfeyiic 

(78J 

= ±29 iiiratL 

Thus, from the resultsof tJie simulalions <lis<iissod i>re- 
vioiisly, wt* c^aii expect a JM:KK> confidenrt' intenal Ihr llie 
measurement of time offset of approximately ±145 ns. 

From Fig. 1 9b, we ^ee. tJiat tJie measured time offsets aie 
within the ±5l)-iis tolenuTce given in \hv IS-M7 standards, with 
the worst-trase 17-ns time offset ocrun'ing for Mve i>agiji^ 
rhaniu I. The tune offset sijeeifieation Is saflstlt^d even if we 
include the ±14.5-iis confidence interv^al. Fruiu Fig. i(Jcj we 



see that the phiise offsets for the nync rhaiuiel and ttie traf- 
fic thaiinel ;ire well wiihiii the ±5()-mrat] tolerance given by 
the stimdard. However, (lie me iis tired ph^ise offset for tile 
traffic channel was yi.H nirad, wliich is outside ttie tolerance 
specified by the standard. 

For the time and phase cjffset measurements presented here^ 
tlie confidence iniervals for the measurements were larger 
than coiil<l be tLst*d for vdid tests. M flLscussed in the section 
on at t'uracy above, ro rjbiain at :ee|)fable nieasiucment accu- 
racy if is necessary to average esUmates of time mid phase 
offsets. For the measurement situation of Fig. 19* acceptable 



)Copr. 1949-1998 Hewlett-Packard Co. 



F( "h ri liiiy 1 mi\ \ 1 1 ■ wlft 1 1 - P;k- kjuv I Jam j j;i1 91 



n 


3 I ; 1 1 ' 


- CODE DOMRIH 


RNRLYZER ^ 

- D . 3 

CarFT E|£^ 

-27. 3 




] \ \ \ \ \ I I 




■ ■ r 


\ \ \ I 










■■ ■■ "'■ ■ — ^^ "" — f " 






T rs ^ r 1^ * ^ ri 




: 


I \ \ \ 




RF GEM 
RF RNL 
flF flNL 
5C0PE 
SPEC flNL 
ENCODER 

RRDlO INT 




31'Ti'iiiOwKf 


"T[[ 


n r 




Com 


T- O 1 £ 


W a. 1 s h L h i^ n 


L'Jl 


. 27 








L I 




sa^^H 



(&} 



CODE DOMFtlll KNflLY^ER 
T ^4 f ^ ilM 



I'i ii 1 i h ^^ Ki 'S »^ 



-0- 04 
23.0 



CarFT _ 



To 



RF GEN 
RF RHL 
HF RNL 
SCOPE 
SPEC RWL 
ENCODER 
DECODER 
RODI INT 



[bl 



1 — r r ^ . ^ ^ ^- CODE DO n R I N 


RNFILVZER — , 
^0* Old 

'''^-10.5 

Car FT Q^B 

-50. S 






1 I 1 


-,::?' 


in , r, . 










......w.,,^, ,,x.^ ,»-- 




T n q ,- r' ^ iPi r1 








RF GEN 
RF flNL 
RF fiNL 
SCOPE 
SPEC flNL 
ENCODER 
DECODER 
RI^DID IHT 


-"-!- i- ^ ....f..^... J.... < ^„,,^,. 


^ -^ — 


,-■ ,3 ^., ^ ,- ,-, 1 ^ 


N -?, 1 ^ K C h o. n ' 


P h ± >^ BILIJI-IB 
b ■? . Ill 1 












iJsI^^^H 



tcl 



measurenient accuracy would have been jiciueveci by aver- 
aging nine estimates to reduce tlie nieiiisurenieni cojifidence 
intervals by a factor of 3, 

The results of tJie code-domain measurements of a base 
station uansinittcr in whicb four full-rate code channels 5, (i, 
7, and 8 are active are presented in Fig, 20* In this case^ we 
see that a sign ifu; ant aniomit of energy is leaked to inactive 
code t hajuieb. Pinm Figs. 20b and 20c, v^e see thai the larg- 
est time offset atid i>hase offset are -15.6 ns and 69 nuad, 
respectively, for the s\tic channeL For these results, a single 
measurement intend of 1.25 ms was used, which results m 
large nieasiuienient confidence intervals. 



Fig. 20. Results of t fide-tloniain 
nieasurements (.ij' a ba.'sp station 
1 mnsimtttT with the pilot (chaii- 
iitl 0), paging channel (I), s>iic 
t4iaiinel (32 J, and four full -rat.e 
1 raffic* cliaiuvels C5, 6, 7, S) ac~ 
live, {a) Cade-domaiti power 
measure nient?^. (h) Time offset 
inea.^UTenients. Ccl PiiMse ofEset 
measurements. 



Ackn owl e dgm ents 

The author wisbes to at knowledge Marcus DaSilva, who led 
the HP 8320 ;3B iJioject, and thank him for bis continual sui> 
port and enthusiasm during tlie de\'e lop meat of the CDMA 
nu^itsnremen! algorithms, for encouraging the writing of tliis 
I jape r. ajtd for c^mefnlly re\ie%ving an early draft, including 
die meticulous checking of die numl>ers presented ui the 
examples. Thanks also to Dave Whipple, who serv ed on the 
lS-95 tmd 1S4)7 standards conmiittees and contribtited to the 
coricepts of measuring wavefonn quahty factor p ai^d code- 
domain power pj, for his support during the development of 
the CD MA tneasurement algoritbins. Spet ial thanks to 



92 February 199$ Hewlett-Packard JoumnJ 



)Copr. 1949-1998 Hewlett-Packard Co. 



Michael McNamee. who developed the data acquisition sub- thaak ^EichaeJ McNamee. Tom Yeagen Dave Hoover, and 

sysieni for the HP S3203B and ported the C rode for the ^latt Hunton for reviewing this paper and pronding some 

measurement algorithms from the «!e\ elopment platform to helpful suggestions. 

the HP 83203B signal processing platform. Spec^ial thanks 

also to Tom Yeager, who perfornied field tests and demon- Reference 

strati ons and obtained the bitmaps of the measuremem re- D.P Wliippie, "North American Cellular CDM4," HmieitFarkatd 

stilts printed in this paper. The author also Ti^lshes to Journal, VoL 44. no. 6, December 199a i^. 90-97. 



Fr'b 111 ary 1 996 Hewlett-PacJanl -1 oiirna] 93 

)Copr. 1949-1998 Hewlett-Packard Co. 



Authors 



FebruarvigBB 




8 Muhiprocessing Workstations and 
Servers 

MattXHarlme 

An engineer/satentist at the 
Sv^itfims Technolcgy Dmsion, 
Matt Marline joined HP in 
19B4at HPUborstanes. He 
was initiaily responsible for 
computer systems design 
including ^loating-pomt co- 
processors, systems perfor- 
^^^ mance analysis, remote sup- 

portability, and test design. In addition to system 
design, he also began workmg on the electrical char- 
acterization of computer sysiems and on bring-up and 
debug methodobgies and tools for current and future 
generation of systems. Before coming to HP, he did 
residential telephone design work at AT&T Bell Labo- 
ratories, Professionally interested in design tor test- 
ability and debug, he's wntten two papers on model- 
ing crosscaupleri transrmssEon lines and electrEcal 
testing. Matt received his 6SEE degree in 1981 from 
Brigham Young University and his MSEE degree in 
19B2 from Purdue Un^versdy, He is a member of the 
IEEE. Born in Walnut Creek, California, he is married,, 
has three children, and is a scoutmaster for a Boy 
Scout troop He enjoys outdoor activities, especially 
backpacking and camping. He also likes drawing and 
watercolor painting. 

Brendan A. Voge 

Brendan Voge is a design 
engineer a ithe General Sys- 
tems Division and is cur- 
rently responsible far VLSI 
design. Previously he worked 
as a design engineer and a 
systems engineer on various 
HP 3000 computer systems 
--' and embedded peripherals. 
He is named as a comventor in two patents related to 
the Runway bus and I/O address translation mecha- 
nisms. He received a BSEE degree in 1982 from Cali- 
fornia Polytechnic State University at San Luis Obispo 
and an MSCS degree in 1984 from the University of 
Calrfornta at Berkeley Whife attending schoo!, he 
worked as a technician and instructor He joined HP"s 
Computer Systems Group after graduation in 13B4. 
Brendan was born in Berkeley, California He is mar- 
ried He likes outdoor activities such as windsurfing, 
mountain biking, skiing, and mountaineering. 





Loren P. Staley 

Laren Staley joined HP's 
Computer Systems Group m 
'981. Since ihaitime. he 
ha.s done research and de- 
velGpmenr on HP 3000 low- 
nnd systems and on all of 
trie low-end and mirfrange 
HP 3000 and HP 90CQ PA- 
1 ./^ ,/ \ RISC systems introduced 
since 1986. He is now an engineer/scienist at the 
Systems Technology Division and is responsible far 
die QveralJ system performance of the HP 9OD0 K- 
class products mcludmg hardware/software synergy, 
hardware resource utilization, and software optimiza- 
tion Before coming to HP. he did fesearch and devel- 
opment at Amdahl CorparatiDn and Rockwell Interna- 
tional He is prnfessicnally interested in systems 
architecture and design for pertormance. He authored 
a paper on the HP 900D K-class technology. Loren 
was awarded a BSEE in 1977 and MSEE in 1978, 
both from Stanford University. He is married and has 
ihree children. He enjoys outdoor activities in the 
Sierra Nevada, such as skimg and backpacking, 

Badir M. Mousa 

A senior industrial designer 
in the general systems tabn- 
ratory ai the Systems Tech- 
nology Division, Bud Mousa 
is responsible for the indus- 
trial and mechanical design 
of HP 9D0O and HP 3000 
servers and wortfstations.. 
inctuding peripherals such 
as the onintem'uptible power supplies His career at 
HP as an industrial designer has focused on the de- 
sign of many computer products and peripherals m- 
cluding terminals, monitors, PC products, digit iimg 
tablets, computer furniture, and cabinets for HP PA- 
RISC systems. He is named as an inver>tor in three 
industrial design patents involving computer prod- 
ucts. He authored several papers on liie HP 2700 
color graphics temiinal. Bud received 3 BS degree m 
industnal destgn with a minor in art from San Jose 
State University in 1975, He joined HP's computer 
group industrial design department in 1977 Befoi^e 
coming to HP, he worked in three industrial design 
consulting offices designing television sets, high- 
fidelity equipment, medical equrpmeni, household 
appliances, toys, photography eguipment, and etec- 
tronic equipment for the manufacturing environment 
Born in Ramallah, Palestine, Bud is married and has 
three c hi Ed re n He is active in the visiting scientist 
program and his hobbies include woodworking, back- 
packing, camping, traveling, and canoeing. He also 
helps coach his sons Little League baseball team 





IS Mylti process or Bus 



WrNiam R. Bryg 

Bill Bn/g was awarded BSEE 
and MSEE degrees from 
Stanford University in 1979, 
■'Vtier graduating Jie jomed 
\U\i General Systems Division 
.;i HP He is currently an 
^ iT'iLjineer/scientist at the 
Systems Technolegy Division. 
Since coming to HR he has 
contributed to the HP 3DO0 Senes 64, PA-RISC archi- 
tecture, HP 9000 Model 840. and HP 3000 Series 930. 
He worked on the HP-UX port to PA-RISC and on the 
HP 9000 Model T5D0 processor-memory bus definition. 
He led the protocol definition team and the verification 
effort for the HP PA 7200 processor, which operates 
on the Runway bus. He rs currently responsible for 
d^e system design and architecture of future systems. 
His work has resulted in nine patents in architecture, 
cache design, and memory design He has coau- 
thored three articles on PA-RISC, the HP 9000 Model 
840, and the HP PA 7200 chip. Bom m Chicago, 
lIHnois, Bill is married and likes folk dancing, science 
fiction, and cooking. 

Kenneth K. Chan 

Ken Chan IS an engineer/ 
scienList at the Systems 
^^chnalogy Division, He 
^ inedHPin19B0atthe 
ii^neral Systems Division. 
1" has worked on the da- 
Mfjn verification of the HP 
GOOD Model 850 and the 
iigic design of the TLB 
coimuiif 'Oi I'it H- ;jl:ii1J Model 855. He co-led the 
develnpmeni of tw processor mterface chip and was 
the project lead for the bus converter for the HP 90O0 
Model T50Q. He was also the project lead for the FA 
7200 bus mterface and participated in the Runway 
bus defmitjon. He is currently working an the memory 
controller design for the next generation of servers. 
He earned a BSEE degree from the University of Cali- 
forma at Berkeley in 1980 and an MSEE degree from 
Stanford University in 19B3 He is professionally ii> 
teresled in oiulti processor bus protocols and is 
named as an inventor m a related patent He has 
coau thored two papers on the PA 72DQ and the multi- 
processor bus of ihe HP 9000 Model T500 Ken was 
born in Hong Kong He is married and his hobbies 
include bicycling arid woodworking 




94 F^bmary 1 996 Hcwlctt-Patrfcafrt J [ >uni . j I 



)Copr. 1949-1998 Hewlett-Packard Co. 



Nicholas S. Fiducc'm 



Gcirdon P Kiirpanek 



w> 




Bom in Dik^^, lllirtois, 
Hitk Bduccia earned a BSEE 
degree in 1979 ar ; - 
MSEE degree sn - 



■. ,ejj|j^[^;l:s^g^ 
^eer at tile Svaterrji T^.- Dion Sfrcecom^ 
ing to HP, he tes tteen rcspcnsic^t m ors-c • 
design of ssfefal processors. memof\ cofitr 
bus canvmtefs on several HP MW and SdiO iysiefrts. 
Most rccemiy. he designed lfi« efecirical: transfer 
methQdoSQgy for the Rynway bus m the HP 9CO0 
Model THHl and devofoped The dr^v^f and receiver 
cifcufts. which interface to ttre bus. He is currentlY 
working on a bus interface chap for the next T5O0 
upgrade. Nick's favorite habby far the last sotteen 
yearn is ballroom dancmg. He also plays fok guitar 
and is learning classical piano 

2S PA im CPU Design 

Kenneift K. Chan 

Author'^ hingraohv appears elsewhere in this sectfOn 

Cyrus C. Hav 

Cy Hay joined HP's Informa- 
tinn Hardware Org anjzatmn 
in 1986afteigradiimmg 
with a BSEE degree from the 
Uniuer{5av oi Cincrnnaii. As a 
^Cll'/JIIH ^^^^ design engineer, tie has 
worked on Itjgic synthesis 
deve[opment, performance 
monitor design. eJectrical 
debugging, characterization, circuit design, and test 
generation He recently was responsibie for the senal 
test generation, multiprocessor verification, and cir- 
curt and artwork translation for the PA 72Q0 CPU 
He has autl'iored or coauthofed several HP confetence 
papers. Cy was born in Shelby, Ohto. His outside in- 
lerests are varied and include alpine skiing, bicycling, 
traveling, jaaz musfc, cookmg, end wins. He ^s cur- 
rently employed at Suniise Test Systems 

John ft. Keller 

Born in Sheboygan, Wiscon- 
sin, John Keller earried a 
BSEE degree from the 
University of Wisconsm at 
Madison in 19B1, He v^ent 
on to earn an MSEE degree 
from the Universuy of Call- 
J . . *j forma at Berkeley in 19B5, 
*- ■ HejnrnedHPinlSaOattha 
Corvallis Division anrJ (S currently a Jead engineer at 
the Systems Technology Divrsion Hrs cDntriboiions at 
HP mcluda working on IC process engmeerirty, SflAM 
design, and IC design for computer systems He was 
responsible for the CM0S-!4 circuit tfanslatturi and 
for electrical characterisation and debuggmg of the 
PA 7200 CPU He has atJlhored or coauthored tv/elve 
papers on tucuit design and process development. 
John was married this past October and loves to 
travel. 







^/scientist at tte 
:JifJology Division 

^uar^k is cuHz— . 
r f or ifie an:-'- :•:: 
croarchiteciijre 



;.c«M^^n. di'tv ersiCUit:^! Lhsr- 
acierization of Ihe PA 72D0 CPU He has also done 
electrical charactenzatton and functional venftcation 
in pbase-Z and phase! on two pn^ious PA-BISC pro- 
cessors He is a coamhor of an IEEE paper on the PA 
7200. Gordon received a SS degree in computer engi- 
neering in 19B8 from the University of Californja at 
San Diego and an MSEE degree in 1991 nnm Stan- 
ford University under the HP Fellowship program, 
He worked at HP during his school years and joined 
HPs Systems Technology Division luli-time tn 198B 
He was bom in Minneapolis, Minnesota and enjoys 
backpacking, sailboat racing, tropical islands, and 
other beautiful places, 

FrafiCEsX. Schumacher 

m Born in Cupertino, Cahfomia, 
'M l^rancfS Schumacher received 
"" a BSEE degree in 13B4 and 
an MSEE degree in 1986. 
both hnm Stanford Univer- 
sity He joined the HP Guper- 
tino Integrated Circuits 
Operation in 193^ and is 
*^^"'' currently an engineer/scien- 

bst at the Systems Technology Division Since |otning 
HP, he has worked on chip design for a bus mterface 
and two CPUs including the PA 7200, and was re- 
sponsible for the PA 7200 cache J/0 design He is 
currently responsible for chip design for the central 
electronics complex of next-generation systems, Hts 
work has resulted in a patent on a delay-locked loop 
circuit and he has coauthored two papers on VLSI 
processors. Francis is married, has a daughter, and 
likes to play soccer 

Jason Zheng 

Jason Zheng was born In 
Jiancfe, Zhejiang in the 
People's Republic of China. 
He obiained a BSEE degree 
from the Zliejtang University 
'n 1982 and a PhD degree in 
biomedical engineering from 
^^^ \/anderbilt Uniyersfty in 

s*^^;^ 1 986. He joined HP's Sys- 

tems Technology Division m 198B As a member of 
the technical staff, he worked on the ALU test devel- 
opment and phase-2 functional verihcatmn for a pan 
of computer systems and more recently was respon- 
sible for the design and verification of the PA 720D^s 
superscalar logic and for its eJecttFcal characterization 
and debugging He is professionally mterEsted in the 
design, verification, electrical characterisation, and 
debugging of VLSI processors He is currently em- 
ployed at Chromatic Research, Inc Jason is married, 
has one child, and enjoys reading and table tennis, 



34 PATOTVefifieation 



Tfiomas B Alexander 




r" '^^AlexsrKfer»S5fyD?ect 
^ onager m th£ 

A " :-no(ogvIat. 
J $TStem$Tedr 
r 1* J He worked on : 
M I sign of tt»e bus ifriertace for 
the PA 7200 CPU and is cur- 
rently managing the PA 7200 
' electrical characteniauon 
project Previously, ne worked on the design of the 
processor interface chip, bus converter chtp, and pro- 
cessor-memory bus definition for the HP -9000 Model 
T500 computer system. Before that, he vw^rked on the 
design of the I/O system of the HP fflOO Model B4Q 
He IS named as an inventor in a patent on a bus pn> 
tocol and has coauthored a paper on the multiproces- 
sor features of HP corporate business senders. He 
received a BSEE degree in 1381 from Purdue Univer- 
sity and an MSEE degree in 1990 from Stanford Uni- 
versity. He joined HP's Data Systems Division in 1982. 
Tom was born in Anderson, Indiana. He is married 
and enjoys woodworking, metal working, and playing 
baskethall 

Kertt A. Oickey 

Kent Dickey graduated from 
P' nceton University in 1991 
■; ''-■ a BS degree in electrt- 

engineering. He then 
jui[ied the HP Systems Tech- 
nology Division as a hard- 
j ware design engmeer He 
has worked on verification 
of the HP 9000 Model 890 
processor board. Model T5D0 system, and Model 170 
system, He also verified the HP 90D0 J-class work- 
stations and K-class servers He is currently responsi- 
ble for verification of the next -gene rati on system. 
Kent is professionally interested in verification and 
logic design and coauthored a paper about an HP 
verification tool. He was bom m Momstowa Mew 
Jersey, He is man"fed and his wife also works at HP 

Oavirf N. Gofdberg 

An engineer at the Systems 
[echiiology Division, Dave 
Goldberg is currently the 
technical lead for the physi- 
cal design methodologies 
and data path design for two 
VLSI chips to be used in the 
nent generation of PA-RISC 
corporate business servers. 
Recently, he's been responsible for the PA 7200 presi- 
licon verification methodology and simulation envi- 
ronment, developing key simulation tools for the ven- 
hcation team. He has worked on VLSI components for 
five generations of HP PA-RISC computer products, 
the PA 7ZQD CPU, two multiprocessor system inter- 
face chips, two MMOS f loatrng-pomt coprocessors, 
and an I/O bus converter. He is professionally inter- 
ested in VLSI CAO tools and has coauthorod HP con- 
fererkCB papers and IEEE publications He received a 
BS degree in 1979 from Southern Illinois University 
and then worked at AT^T Bell Laboratories designirig 





)Copr. 1949-1998 Hewlett-Packard Co. 



hVbnjiir>' I EM> I I^^U'lt^U-riickiif d it Minimi I M5 




ASICs for computer and etectmnic switching sys- 
tems. He received his MSEE degree in 1985 from the 
University of Wisconsin He joined the jntormatftin 
Hardware Organization of HP's Inlormation Tech no f- 
DHV Group in 1985. Dave was born in Chicago, 
Illinois He has a son and volunteers his tinie at his 
son's primary school, He afso enjoys ooTdoor activi- 
ties such as camping, hikmg, Whitewater raftmg, and 
scuba diving, and is a self-proclaimed former base- 
ball fan. 

loss V. U Fstra 

Ross La Fetr'a was awardea 
a BS degree in 1384 and an 
MS degree in 19B5, both in 
br.eral engineering from 
^ k°^ '' y .idiveyMudd College. After 
fc\i , w oradiiating. he joined HP's 

^ \ in Tor m at i on Tec hn oi ogy 

^^ '' Group He as now a member 

^ ofihetechnicaistaffatthe 

Video Communications Diyision and is responsible for 
content preparation for video on demand. PreviDosly 
he V(/orked on jhe posidesign testability of the PA 
720D CPU. Before that, he worked on electncal verifi- 
cation for HP 30D0 Series 990 and cache memory 
design and on etectrical verification of the W 3000 
Series 990 His worfe has resufted in three patents on 
VLSI test, moltilevel cache design, ar^d a smart cache 
protocol. He has coauthored a paper on the processor 
design of the HP 3000 Series 990. Born in Los Angeles, 
California. Ross is married and is expecting his first 
child, His hobbies mclude hikirig, backpacking^ bicycl- 
ing, gardening, and folk dancing. 

James R. McGee 



0\ 



Jim McGee was awarded a 

BS degree in electrical engi- 
neenng and a BS degree in 
physics, both jn 13S1 from 
JM Jowa State University of 
^bPk' X^ ^1 Science and Technology. He 
^^^^^^-Jr^B went on to earn an MSEE 
f ^^ ^^ degree in 1 985 from the 

' University of Minnesota. He 

joined HP in 1986 at the Systems Technology Division. 
He has contributed to processor and compoter system 
verification including the presiiicon and postsilicon 
verification of the PA 72D0 CPU and the HP 9DQD J- 
class worltstations and K<fass servers. He is currently 
responsible for memory controller control logic design. 
Jim was born m Ames. Iowa. He \% married and is an 
occasional backpacker and tinkerer. 

Nazeefn Noordeen 

■ H^^^^^ ^ project manager at the 

I I^P^^^ Systems TechnoEogv Divisio n , 
p^^f^^H Nazeem Noordeen is respon- 
" sible for managing the test- 

ing and debugging of the PA 
7OT CPU. He joined HP in 
1 988 after graduating with 
df: MS degree in coroputer 
engi neenng from Syracuse 
University in 1988. Two years previously, he earned a 
BSEE degree from Pennsytvania State University. 
Since coming to HP. he has worked on several VLSI 
chips and high-end systems mcluding the HP 9000 
Series 80O and the PA 7700 CPU. His responsibilities 
included verification, design, test, and deboggiog. He 





also designed a block that provides superscalar fonc- 
tionality. Mazeem has coaothored an article about the 
multiprocessor features of corporate business servers 
He was born in Madras, India and is married, His 
hobliies include tennis, goif, and gardening. 

Akshya Prakash 

Akshya Prakash received a 
BSEE degree in 1983 from 
the Indfan Institute of Tech- 
nology in Bombay. He then 
VH earned an IvlSEE degree in 
W* 19B3 from the University of 
Texas in Austin. After gradu- 
- ating, he joined the Systems 
fochnolDgy Division and 
'.vurked un VLSI UKsiyn ru: several generations of HP 
PA-RISC computer projects He was the project man- 
ager for the multiprocessor functional verification of 
the PA 7100 CPU and for the electncal charactenza- 
bon and debugging of the PA 7 ZOO CPU He is cur- 
reotly responsihie for managing the VLSI design and 
development for HP's next generation of computer 
products, He has coauthored several papers for HP 
conferences. Born in Mu^affarnagarjndia, Akshya is 
married, has two children, and enjoys cricket, tennis, 
and camprng. 

44 Memorv System Design 



Thomas R. Nr^tcfikrss 

A hardware product plan- 
ning engineer, Tom Hotch- 
kiss is currently responsible 
for defining the require- 
\ * me nts for future computer 

jV^^i systems at the Workstation 

lljJL^ ji«Hh Systems Division. Previously, 
ij^H|mJ^^H he was the lead engineer 
'^^^***"^** and architect for the high- 
pertormance VLSI chipset for the HP 9000 K -class 
memory subsystem, Before that, he was a VLSI de- 
signer for a pair of HP PA-RISC CPU chipsets and 
served as a design consuEtant for a Hitschi-designed 
PA-RISC CPU. He has authored or coauthored two 
papers on cache timing and TLB control let design. He 
received a BS degree in electrical engineering and 
computer science m 1934 from the University of Con- 
necttcui. After graduation, he joined HP's Systems 
Technology Division. Born in Bridgeport. Connecticut 
Tom is married, has a son. and enjoys home brewing 

Norman D. Marfichke 




:^m\ 



If ^%-)^: 



\ 



Born in St. Joseph, Michi- 
gan, Norm Marschke re- 
ceived a BSEE degree in 
1963 and an MSEE m 1964, 
both from the University of 
Michigan. After graduating, 
he jomed HP's Frequency 
and Time Division. He is now 
a hardware design engineer 
at the Systems Technology Division and is responsihie 
for the design of the processor PC board, including 
the level-2 cache Recently he worked on the design 
of the SIMM boards for the HP 9000 K-ctass memory 
system Before that, he sen/ed as a design engineer 
or project manager for HP 3000 computer families, 
the HP 300 cnmpoter, a Eouner digital signal analyzer. 



a multichannel analyzer, and frequency counters He 
has coauthored papers on the HP 300 and the Fourier 
analyzer IMorm enjoys downhill skiing, hiking, and 
commuting to work daily on has bjcycle. 

Richard M. McCloskv 



6^ 



A design engineer at the 
Systems Technology Division. 
Rich McClosky is currently 
the design lead for physical 
aspects of the next-genera- 
non VLSt I/O bus converter 
chip for HP 9000 and HP 
3000 enterprise servers. He 
received a BSEE degree in 
197B and an MSEE degree in 1978, both from Auburn 
University Ah:er graduating, he foined HP's General 
Systems D^v^sion. He contributed to the development 
of the floating-point microcode for HP 3000 Series 33 
computers and did VLSI chip design for d>e HP 3000 
Series 50 and 60 memory subsystem controller ar>d 
for PA-RISC I/O and CPU chips He contributed to the 
memory subsystem design as well as the board de- 
sign and layout for the memory carreer board used in 
the HP 9000 J/K-class systems and HP 3000 Senas 
MKS servers He is a member of the IMSPE and the 
IEEE and the coauthor of a FORTH development sys- 
tem for PCs Rich was bom m Tarnpa. Florida.. He 
races a Formula Ford race car in SCCA at the natiortal 
level and is the former national champion of MDRRCA 
one-twelfth-scale radio-CDnirol model cars. His other 
hobbies include woodworking, metal wo rismg^ fishmg. 
and hilling 

52 Hardware Cactie Coherent I/O 

Todd J. Kjos 

^^^ With HP's Open Systems 

^fl^^^^ Software Division since 

^^^^R t990, Todd KjDS is currently 
■pv nm a lech n ica I CO ntr i bu to r wu rk- 
m^ If «j ing on platform architecture 
\^^ and design of future HP-UX 
^JB^J^L systems. He was the techni- 
H^^^^^^^^ cal lead for coherent I/O 
B™^* -^^^^ support in the HP-UX operat- 
ing system. He was also the technical lead for the 
convergence of the HP 9000 Series 7QD and 800 I/O 
subsystems in HP-UX 10.0. Prevrously, he was the 
technical lead for the HP-FL fiber link disk drive inter- 
face Before juining HR he worked as a hardware 
engineer at Raytheon Company and as a software 
engineer at the Cambridge Technology Groop He 
was awarded a BS degree in engineermg physics in 
1 9B6 from Oregon State University and an MS degree 
jn computer engineermg in 1&90 from Boston Univer- 
sity Todd was born in Palo Alto, California He is mar- 
ried and enjoys traveling, backpacking, skiing, and 
reading 



Helen Nusbaum 




A project manager at the 
Systems Technology Division, 
Helen Nusbaum ts responsi- 
hie for VLSI system simofa- 
'ion and formal veri beats on. 
Previously she worked on 
VLSI design and simulation 
nf the 1/0 adapter for the HP 



96 



F('l)i"Uiir>' Ifjytt iTtn\1tMT-Packai"d .J<i!ii"n:il 



Copr. 1949-1998 Hewlett-Packard Co. 



9000 J-class 3^ K-ctsss sysiems Her wort along 
v^ a&m sii^mrs in tfiis ismie, haj resjite:! ii D©vd 
«r>g patents relsied to cadie cohe-f- , is 

awarded a BSEE degrc-s r '^SC * 
of Calif Dfnia gt Davts - :-. 

froin Califs-n a Siatt: -^.z 

bom in6t . __ _ _ . L ._ 

spencte mosi ot Her free tirt^ pJ^aymg with \m tv. 

ve-ar-Did son 

Michael K Traynor 

^^^^ Mike TraymK was awarded 

^^^H^ a 6S degr^ in computer 
P^^^^J science m 1 988 fnjm tfie 
K? "S^J Califomaa Polytechnic Slate 
^ University. San Luis Obispo 
Jl After graduatmg. he joined 
^ H Fs I nf armation Networks 
Division He is currenlJy a 
'^ lectinical conirthutDr respon- 

sible for duster performance. Previously, as a software 
engineer; he was responsible for Ttie development 
and support of netwodc drivers and firmware fortlie 
H?m, MPE/iX, and MPE V operating systems His 
most recent respctnsibitilies mtNs area involvgd 
porting and tuning network rirFvers for the HP GODO 
J/K-class systems. Mike is prnfessionalfy interested 
in Elata cam munj cations He is a member L>f the IEEE 
and the ACM Born in Los Angeles, California, his 
hobbies mclyde photography, bikm^ and aquanLims 

Brendan A. Voge 

Author s ciayrapiiy appears elsewhere in this sectian, 

60 Fi bre C ba n ne I Ch ips @l 

Justin S. Chang 

A member of the technical 
staf^ at the Optical Commu- 
nrcatians Diviaion, Justin 
Chang is currently responsi- 
ble for the design of high- 
speed bipolar integrated 
circuits mcluding an eye- 
safe gigabii laser driver He 
joined HP in 1982 at the 
Rq Seville Nelwurks Division where he worked as a 
digital systems designer developing LAN products. In 
1989 he mowed to the Communtcitiorts Components 
Division where he worked as a IC designer designing 
wireless communication components. He then joined 
the Optical Cnmmunicatians Division in 1992 as an IC 
designer responsible for designing high-speed data 
communicaiions contponenta For the fiber channel 
chfpset he worked on laser driver and safetv crrcurtry 
and chipset systems test He is professionally inter- 
ested in ICs for data communrcatrons and has co- 
authored two papers on mfxed modulators and giga- 
bit fiberoptic transmission He was awarded a BSEE 
degree ir^ 1982 from the University of California at 
Berkeley and an MSECE degree in 1909 from the 
University of Calffomia at Santa Barbara. Justin was 
born in Taithung, Taiwan. He is married and likes 
outdoor activities such as bicycling, Rollerblading. 
and racquet sports_ 






Richard Dugan 

^^^^ Ridiafd r 

^^^^^ manage' ■ 

1^ .^^M mumcat . 

■| t- F manage 

M^J ICdevel' 

^■^^ BSEEd£.. 

^^^^L^^^^B a? Santa Barbara 3ird a ^) 
^^^^^*^^^^ MSff degree m 19^ frum 

Stanford Un^^rsrty He joined HP's Microwave Sys- 
tem- ■ 1982. He's contrrbuied to microwave 
dfc ' -. riEgJion gnd has worsted as an fC de- 
signer arKj a design group manager He is prgfession- 
ally interested in high-speed ICs and data communi- 
cation and ha$ ctsaytbored two papers on mixers and 
modulators Richard was horn in Pittsburgh. Pennsyl- 
vania Hb has been married for seventeen years and 
has three-year-c Id twins. His hobbies include cycling, 
fishrng, and cooking 

Benny W,H. Lai 

Bom in Hong Kong, Benny 
Lai received a BSEE degree 
<n 1982 and an MSEE degree 
m 1983. both from the 
Umversrty of California at 
Berkeley, He joined HPs 
Microwave Systems Division 
ml 981 and worked on 
device modeling and srmula- 
tion techniques, microwave amplifier design, deci- 
sion circuFt design, and the G-link chipset. He then 
transferred to tbe Optical Communications Dfvision 
where he iS a principal member of the technrcal staff, 
currently responsible for 672-Mbit/s clock data recov- 
ery [CDRI postamplifier design and fibre channel arbh 
trated loop IC design For the fiber channel chipset he 
contributed to the design of the transmitter and 
receiver architscture and the phase-lncked loop He 
also worked on the logic library and array designs. Hu 
IS named as an inventor in two patents on CDR 3rr:hi- 
tecture and the G-lmk codir^g scheme. He has two 
patents pending on the integrator and the loss-of- 
sigrral deiectcr He has authored three papers on the 
CDR IC and the G-hnk IC chipset. Benny is married 
and enioys gardening and landscaping, 

Margaret M. Nakamoto 

nBorn in Honolulu. Hawair, 
MargarBt Nakamoto was 
awarded a BSEE degree in 
1 989 bom the University of 
Hawaii and an MSEE degree 
in I99Z from Stanford 
University. She forned HP's 
Micrnwave Systems Diviston 
in 1989 and contributed to 
microwave IC 4;haFacteri2ation and GaAs IC design. 
Then, as a member of the technrcal staff at the Opti- 
cal Communications Division, she corstrrbuted to 6- 
link IC ctiaracterisation and test support and worked 
on f/0 cell destgn. chip verification and simulation, 
and test and charaderr^atton support for the fiber 
chanrvel chipset. Currently she is responsible for fibre 
channel arbttraied loop IC design. She has authored 
an IEEE paper on GaAs IC design Margaret is mar- 
ried and enjoys bicycling, aerobics, and Rolterbfading. 



S8 Hardware Code liispectroR 

Joseph J. Gilrsy 




? Gilmv it #n f^^D 9'>^» 



Lijortof 

UisaitjH iuOlSfof 

-, and simulation 
J uie development and 

!j[fpart of ASIC design 
meihodQiDgies He joined HP m 1984 ai tte Logic 
Systems Division He initially worked as a member of 
the technical staff burl ding and supparttng the CAEE 

software that links the HP Design Capture System to 
the system GenRad HILO '"^ simulatots. During this 
time, he authored a paper on the integration of soft- 
wnm and hardware simulation He was the process 
manager for the HDL code inspection process and 
often acted as moderator for individual code inspec- 
tions Before coming to HP. he worked at Compion 
Systems as a hardware and software system designer. 
He is professionally interested in object-Driented pro- 
gramming in C++. He was awarded a BSEE degree 
from the Universfty of Illinois at Champargn-Urbana. 
Joe was born in Waukegan, Illinois. He is ma Tied 
and has a son and daughter. He likes bicycling and 
plays city league basketball on an HP team called 
""The Verticalty Challenged " 

73 Code-Damairt Measyrements 



Raymond A. Brrgenbeier 

Ray Birgenheier has been a 
consultant and development 
engineer in digital signal 
processing and digital com- 
munications at HPs Spokane 
Division since 1981 He con- 
tributed to the standards on 
mndulation accuracy pre- 
pared by two subcommittees 
uf the Iclecom mum cat tens Industry Assoc i anon and 
ptoneered the development of techniques and algo- 
rithms for measuring modulation accuracy and code- 
domain power of cellular radio transmitters He is 
named as an inventor rn two patents on premodula- 
bon filters and two c\\ modulation measurement 
techniques and apparatus He developed the modula- 
tton measurement algortllims for tfie HP 1 184 7 A, HP 
B3203A. HP 832038, and HP Bg24C measurement 
systems, which verify the FiF performance of TDMA 
and CDMA digital cellular transmitters He received a 
BSEE degree in 1963 from Montana State University, 
an MSEE degree in 1955 from the University of 
Southern Calif pmia, and a PhD degree in 1972 in 
electrical engineering from the University of Califor- 
nia Bt Los Angeles He worked lor Hughes Aircreft 
Company's Radar Systems Division from 1963 to 
1980, where he became a senior scientist in 1976. 
Since 198D, he has served as a professor and chair- 
man ot the Department of Electncal Engineenng at 
Gonzaga University. He is a member ot the IEEE 
socreties on communicatfon systems and engineering 
education A LI,S. Navy veteran, flay was bom in 9rll- 
mgs. Montana He is married and has seven children 
and SIX grandcbiidren He is active in his church and 
enjoys hiking, hunting, and fishrng. 




)Copr. 1949-1998 Hewlett-Packard Co. 



I'Vbniaiy ItJfMJ 1 k*wki t-Pat'kaict .J(nini;il 97 



98 Ft^bnjai-y 1^90 Hewitat-Packard Jouinal 

©Copr. 1949-1998 Hewlett-Packard Co. 



Pt'bniai:^ 1 996 Hewletl-Packard Jow mal 99 

) Copr. 1 949-1 998 Hewlett-Packard Co. ^_ 




l!!Kl ^. 



EWLETT^ 
PACKARD 



59B4^17i 



)Copr. 1949-1998 Hewlett-Packard Co. 



